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ABSTRACT 

present day shortcomings in information retrieval are 
th€i results of a failure to properly contend with the problem of data 
representation. The index provides the necessary linkage between a 
muitipliclty of sources and a single receiver. Whether considfsring 
the source/doeument-space interface or the query/index interface, the 
elements of the underlying Gonramni cation phenomena are the same: sets 
of documents, sets of attributes, and sets of relations expressing a 
connection between documents and attributes. The essential operation 
of the indexing system is the creation of a representation oi the 
document space. The analysis — document transformFitions and the 
final index- query transformations are shown to be, respectively, a 
prerequisite to, and function of ^ the document space representation. 
The operating characteristics of the indexing system are modeled by 
means of the index space. From a different point of view, the concept 
of error, organization, information and search are introduced through 
a consideration of the indexing process as a thermodynamic system. 
Thus, Indexing is viewed as an order- increasing operation that 
identifies common data elements and relations between data elements 
present in the input dociment stream. (Author/MM) 
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CHAPTER I. rNTRODUCTION 



'But I should like to know.,.’ Pippin began. ’Mercy,’ cried 
Gandalf. 'If the giving of information is to be the cure of 
your inquisitiveness, I shall spend all the rest of my days 
answering you, ' 

J. R. R, Tolkien ; , Two Towers 

Pippin is a Hobbit and it is well known, at least Tolkien tells us so, 
that Hobbits are, by nature, a very inquisitive people. At the slightest 
provocation they will produce a barrage of questions that will eventually 
dull even the sharpest of minds. In the brief exchange quoted above, 

Tolkien has identified Cperhaps unknowingly) several important issues that 
are worthy of consideration. Like Hobbits, we certainly would want to learn 
more about the meaning of the following words and phrases s "I should like 
to know," "information," "inquisitiveness," and "spend all the rest of my 
days in answering you." Somehow the Ideas these terms convey are all vaguely 
familiar since we have all experienced the need to know something and, some- 
times, we have actually received answers to our questions. Perhaps the most 
troublesome point rests with the concept of spe'nding oil the rest of one's 
days in answering. Is it possible that "information" is Indeed an unlimited 
qunatity—a resource beyond measure? Already we have started to ask questions 
Most authors begin a treatise, dealing with topics in Information Storage 
and Retrieval with an authoritative and somewhat threatening statement 
concerning the "information, explosion, " Often, this term is also used as a 
convenient catch*-all phrase designed to suggest knowledge of the IS&R field, 
Weishall attempt to be somewbAt more careful and ^nfine our remarks to things 
that we "know for sure," For instance,' from research In eagriitipe diseonanoe 
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(see Weick fl]) it is known that the value placed on a message is directly 
related to the magnitude of the effort required to understand it. Well, we 
do know that people have worked hard at trying to understand the message 
called informavion ecsplosion^ but as Hobbits, we would certainly want still 
to ask several questions because we feel we knew very little about the infor- 
mation phenomenon. 

From all appearances, our society has evolved to a state of dependence on 

the recorded message. Thus, instead of dealing with actual eKperlences, we 

manipulate facsimiles of them. Manipulation of such facsimiles is an activity 

called information storage and retrieval. Bar—Hlllel |2] provides a working 

definition of the problem central to Information storage and retrieval (IS&R) : 

Assuming that there exists somewhere a body of recorded knowledge- 
in technical terms, a collection of documents— and assuming that 
someone has a certain problem for the solution of which this 
eollectlon might contain pertinent material, how shall he decide 
whether there are in fact docvments in this collection that contain 
such pertinent material, and, if so, how shall this material he 
brought to his attention? 

In this chapter we shall consider the nature of the process of ’'bringing 
material to one’s attention". We will find it helpful to ask questions 
about what we sem to know about this process, the origins of the need for 
IS&R and, finally, the fund^ental nature of the main prohlem of IS&R. 

Partial answers to these questions will be provided through a consideration 
of both a schematic for the IS&R process and directions for further research. 



materials opines ; 

"The library grew not only because classical works were bought, hut because 
of the;ext 5 y authors." This statement 
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accurately expresses present-day trends 5 but Bonnard refers, not to a growing 
modern metropolitan library, but to the great literary expansion that took 
place during the Hfillenlstic period of ancient Greece, People have been, 
and will no doubt continue to be, concerned about the rapid proliferation of 
documents. Obviously, the concept of an "Information explosion" is not a 
new one . 

Another device that is frequently used to highlight the "information 
explosion" is the figure depleting the exponential growth rate of the 
literature of various disciplines. One source 14] has eatimated that there 
are recorded, in one form or another, 10 trillion alphanumeric characters. 
Furthermore, this collection appears to be growing at a rate of about 
10 billion characters every twelve years. Based on the estimates, it is 
understandable that the number of scientific journals has grom to over 
100,000 during the last 300 years [S] . These figures make it easy to envision 
bleary-eyed researchers attempting diligently to read through a rapidly 
growing mountain of reports and data. It is senseless to dispute any of the 
data concerning the gvwth of documentation. Rather, let us consider briefly 
some of the eonditione preaent in both society and science that have caused 
this "explosion," 

There are, no doubt, numerous phenomena that, in some way, have contrlbut- 
eci to the growth of the store of recorded materials* however, we shall 
restrict our attention to a brief outline of Just six basic faetbrs Cadapted 
ih patt from Mikhailov 16] ) ; ' - ^ ^ 
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• The shift from folklore to the written tradition. 

In Medieval Europe, most bistorj?, literature and tradition was transmitted 
from generation to generation by oral means Csongs, talee, eta, ). The 
invention of the Gutenberg press brought on an almost complete reliance on 
the printed word as the vehicle for coEmunlcatlon, A collection of ’’records" 
was no longer Imited to the confines of human memory. 



» The increased popularity and application of the scientific method. 
Following the IRenaissance, the scientific methorl became the dominant 
philosophy of the Western World. The rapid growth of the various "sciences," 
the emphasis placed upon theory and the need for experimentation, have all 
interacted to increase the amount of data that must be recorded, stored and 



comiunlcated , 



• An IncreaBe in resources upended in discovery, > 

Written scientific conmunlcatlon has increiased' simply because of the increase 
in the number of people involved in reaearch. The expenditure of other 
resources (as reflected in costs) has given rise to the need for numerous 
"progress" and "juatlf icatlon" reports. 



• The decrease in time lag between discovery and application, 

This is a reflection of the rate of "progress" and, in terms of documentation, 
means more papers, reports, patents, abstracts and other types of documents. 



• The increased need for reliable decision-making information. 
Complexity in science, government^' industry and society in general creates A 
need for"processed" data for decision-making, jj 
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• The Increase in the araount of data resulting from scientific 
experiments . 

Because of the technology which research makes possible ^ Individual experi- 
ments can be made to yield ^tremely large Miounts of recorded data. 



From a brief consideration of these six factors one may conclude that 
any problems associated with the storage and retrieval of data arise not as 
a consequence of a sudden "information Ksplosion" but as a consequence of a 
normal growth process. Quantity is certainly one of the leading problems In 
IS&Rj every discipline is confronted with so many recorded documents Con 
paper, magnetic tape, film and other media) that they defy organization, 
manipulation and retrieval. However, it should he clear that problems that 
we experience today are really a conaequence of a prolonged history of dls- 
orgariization, a lack of planning and Cof greatest Importance) a fundamental 
misunderstanding of the techniques of data organization. These problems are 
only beginning to be dealt with in the field of IS&R. 

Perhaps the toughest problem in IS&R centers on data representation. If 
vast collections of data are to be used and used effectively incorporat- 

ed into the processes of the scientific method and decision making) , then they 
must he amenable to accurate and complete searching. But the value to be 
derived from accurately-represented and well-organized data hinges on the 
assumption that, if the searcher is able to make effective use of these 
collections , a costly duplication of effort can be avoided. Fugmann's [ 1 ] 
astimate that almost 30% of the work done in chOTiatry is a duplication of 
previous wprh suggests that there is much work still to be done in IS&R, 




Furthermore one might ask what It would profit him to turn to data retrieval 
systems for answers to his queries when 30% of the world’s documents dealing 
with phenothiasines (for instance) are misindexed JSJ by such systems? As 
a result of all of this, moat researchers are perceptive and pragmatic about 
their information problans. Their inforaation gathering procedures follow 
these steps CMellon 19]) : 

... first, by inquiring of the individual who knows; second, by 
performing the experimental Investigations necessary to ascertain 
the desired facts; and, third, by consulting the scientific 
literature, where a record may be found of the published reports 
of others’ work on the subject in question. 

In the face of all of these problems, scientists attempt to reduce the burdens 

of comaunlcation by becomming more specialised. While specialization In 

itself is useful and probably is a logical outgrowth of increased soientific 

activity £10], it creates the possibility of Intellectual Isolation, This 

means that the researcher may increasingly fail to become aware of significant 

work, carried out in other disciplines, that may impinge directly upon his 

own efforts. Thus Information Science, a.pd more specifically IS&S, must not 

only find ways of dealing with a growing collection of docmaents but, more 

Importantly, must find ways of overcomming the growing isolation of scientific 

disciplines. 

2. The Infomation Retrieval Process 

NjMeroua solutions have been offered of the growing Information transfer 
problen, A consequence of these "solutions" is the current trend of 
moving responstblllty for effective scientific connnunlcation away from the, 
librarian and into the hands of aystemdeBlgners, This trend has resulted In 
the development of a profusion of Information* storage and retrieval systems, 
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each designed to solve the Information transfer problem, These systems^ 
whether manual or automatic , all attaipt to provide searchers with answers 
to the question; "l should like to know, 

Information storage and retrieval sya terns can be described In the term- 
inology of Marschak 111] as a conbinatlon of two purposive processing ahains. 
These two chains are depicted In Figure 2,1; 1 have chosen to call them 
domment prooessing and retrieval processing , Both processes , which are 
greatly simplified in this figure, actually involve multi-level and multi- 
step operations designed to expedite the transfer of data. The document 
processing chain shows the flow from document creation to document acquisition 
(by the IS&R system), representation and document storage. The first three 
stages are paralled by the retrieval processing chain In the conception 
(realization) of the Information need, the clarification of the request, and 
the representation (coding) of the request. Both chains share, through the 
representation stage, the operations of selection, content analysis, index- 
ing and coding. Document storage Involves, in addition, the process of 
accmiulatlon • The two processing chains merge at the searching operation 
where retrieved data (potehtially information) are disseminated for evaluation. 
The dotted lines in the figure indicate the possibility of repeated cycling 
through the retrieval process. 

The successful merging (in terms of answered questions) of these two 
chains depends on the achievement of coimnon understanding 112] between the 
mechanics of the storage system and the actions of the searcher. Opefatiorial- 
ly, comnon understanding is only made evident through the success of the data 
searcher. But theoretically, common understanding can be evaluated in terms 
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of shared agreement about the worth of the product the retrieved data) 

and in terms of adherenoe to the rules of Interact ton, I shall discuss the 
nature of this conmon understanding in subsequent chapters of this 
dissertation. 

As a final comment, one cannot help but be impressed bjr the diversity of 
research efforts in and the copiousness of the literature of Information 
Storage and Retrieval,* Almost every branch of mathematics has been employ- 
ed in an attempt to satisfactorily model the processes depicted in figure 
2.1. Generally, such efforts have been inconclusive. While quite a bit is 
known about algorithms for data storage and file handling operations, little 
is known about the proper techniques for document selection and representation. 
Even less is knowm about the manner in which searchers go about and, eventual- 
ly, satisfy their Information needs. In a general way, one may say that 
workers in the field of IS^R are at present only promulgating a type of' 
professional folklore. Individual studies are difficult to comprehend, let 
alone evaluate, in the absence of an underlying theory Theory la a pre- 
requisite to the successful recording of a "tellable history." 

3. Directions 

The major conclusion to be derived from this brief overview of the origins 
and nature of the infarmation problem is that we should stop being overly 
concerned about the of data produced J,by:sti)s sciences. Rather, our 

attention should be directed toward improving the quality of the representation 

i * We shall not here att^pt to review this large body of literature. The 
reader is directed to 'Volumes 1-6 of the Annual Review of Information 
Science arid Technology 114] as a suitable starting point for such a review. 
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of these data. It should be obvious that the sciences have not provided 
the needed "ordering framework" for the representation and dissemination of 
their myriad results. In fact, it could even be argued that many of the 
problems that we must work to overcome are caused, as Fugmann 113] puts it, 
by "...the lack of order that science has tolerated among its own results 
from the very beginning." To overcome this lack of order, the field of 
Information Storage and Retrieval must have as its goal the establishment 
of the rsquisite ordering between the sciences. Grazlano 115] makes this 



a little more precise; 



The proper concern of the science of documentation IIS&Rj then may 
be thought of as consisting of the operational methods of identify- 
ing elements, distinguishing elements from each other, and for trans- 
mitting sets of patterns from one time and/or place to another in 
such a way so as not to destory the power of the symbols to convey 
exact concepts. 

Throughout this chapter it has been implied that the operation of document 
representation is crucial to the success of information retrieval. In fact, 



the central theme of this dissertation Is an analysis of the nature and role 



of this representational activity. Indexing is identified as the prime 
exemplar of this activity. It is believed that a comprehensive theory of 



the indexing process will adequately serve to represent the nature of the 
comon understanding called information retrieval. I 

In Chapter 2 I will further consider the topic of the role of indexing 
in IS&R processes . Attention will be directed to the form of a theory of 
indexing, especially; with respect to the generalised role of theory , Finally, 



a statement will be made concerning the problem associated with present-day -V 
indexing" practicesv' In Chapter 3, by way of historical review, the sparce 
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literature of Indexing theory will be briefly explored. Following this, 
Chapter 4 presents the proposed Inducing theory. The conceptualizations 
derived from this theory will then be used in Chapter 5 for a reconsideration 
of "Information need," "Inquisitiveness" and "relevance," Finally, in 
Chapter 6 I su^arlze the previous chapters and comment on the possibilities 
for applications and for future research. 
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CHAPTER II. INDEXING: ART, THEORY AND MODEL 



Without a theory, however provisional or loosely formulated, 
there is only a miscellany of oBaervatlons, liaving no 
significance either in thmselyes or over against the plenum 
of fact from ^^Ich they have been arbitrarily or accldentlally 
selected, 

. A, Kaplan The Con^at of InqwiTy 
This chapter is designed to provide the supporting framework for a 
statement of the problem central to the research the results of which are 
reported in this dissertation. Accordingly, further attention will be 
directed toward a consideration of the role of Indexing in information 
storage and retrieval processes. Data relevant to this topic will be 
obtained through an analysis of alternate definitions of indexing and 
through a consideration of the intrinsic Importance of the Indexing operation. 
Some unanswered research questions will then be contrasted with present'-day 
indexing practices and guidelines. Finally, since this dissertation presents 
a theory of induing, special attention will be directed to an analysis of 
the functions of theory^ modet and definition in the organization and under- 
standing of a miscellany of observations. 

1 . Research Trends 

In Section 2 of the previous chapter I commented briefly on the prolifera- 
tion of research in the field of information storage and retrieval, I find 
it difficult, if not impossible, to completely categorize all IS&R-rela ted 
studies. At best, only a general grouping can be effected, Taulbee ilj , in 
1967,=ddsntif,led h of Investigation Cor, shall we say, 

activities) in IS&R. It la believed , that this classification remains valid 
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vith rsspfict to ptsssnt^day effotts* 

e fundamental investigations sentence parsing and assoclattve 

storage) 

• reports of experiencea with, operating systems 

• guidelines for systCT design and modification 

• document relevance assessm^t 

• the "hov^to" for implem«itatlon 

• bibliographies 

At present, the activities that fall undM the "fundamental" label include 
the development of storage and retrieval algorithms, natural language 
SCTantic and syntactic analysis, and the development of question-answering 
systems 12], Conspicuous by its absence from the above list is the develop- 
ment of a cohesive theory of Information storage and retrieval. Although 
researchers often refer to infomaHon retrieval theory it appears that this 
theory is an unstated (perhaps unformed) amalgamation of theories of specific 
retrieval functions— a. logic, searching and storage techniques. The 
same accusations can be leveled at the discipline of indexing, Markus 13], 
in the early sixties, outlined areas of much needed research in indexing. 

Some of the following were included I index format; index use patterns; the 
teaching of indexing; increased indexing speed; equipment modification and 
computer program development. It is again interastlng co note that ind&sirtg 
tfeeorx/ was not (and is 'Still not) one of the areas mentioned. 

It is correct to assume that Computer Science can be of utility in the 
solution of the many Infomat ton retfieval problOTs; ' However, it is not 
correct to assume that such solutions can be effected essentially overnight, 

■■X- ' . 

- X ' .- v' 
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Unfortunately, most of the computer-'based storage and retrieval systems that 
are in existence today are the result of the "urge" to apply the remedial 
powers of the computer to the prohlOT without full appreciation of the 
problm being "remedied," Consequently, the following research and develop- 
ment cycle Is firmly establlahed I4J •, the need for computer-based retrieval 
systems is feltj computer-based systems are created | the lack of sultahle 
evaluation criteria Is felt^ research Is conducted on evaluation techniques | 
new syar sTna are built| and so on... I conclude that this cycle must be 
broken If measurably effective progress is to be made. A theoretical basis 
for IS&R must be developed that, for a given application (specific informa- 
tion retrieval problem), will yield appropriate systHns evaluation criteria. 
Such a development is prerequisite to systems implementation. It is believ- 
ed that the elements of such a theory will Mierge from a consideration of 
indexing as viewed from the interdisciplinary philosophical framework of 
Information Science. ’ 

2 . Indexing as Art 

The absence of a unifying theory for information storage and retrieval 
(as for most of its component processes) is emphasised by the many divergent 
definitions of indexing. Indeed, there appear to be as many definitions of 
indexing as thefe are iri dlvidual indexing applications and studies. A iimlt- 
ed sampling of these definitions includes the following conceptualizations i 
a representation of content I a systematic guide to the content ^ an 
identification tag I a product serving to point out, direct and guide; a 
search access point; a dictionary of nomenclature! an association between 
concepts and terms j ■ a means Iqf making.; information available . At this po int 
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the reader may ask; 'But I should like to know..,'. My only rejoinder., 
when faced with so many definitions, is a terse conclusion; the confusion 
between the definitions of the concepts of Information, index, indexing, 
retrieval and system (to name a few) rmst be resolved in the development of 
a useful theory. 

In addition to the number of definitions of indexing, the situation is 
further complicated by the variety of types of docum^ents Indexed and of the 
number of resulting indexes.* One wonders if there is not some formal 
connection, or relationship, between these different types of Indexes. This 
question remains unanswered in the literature of IS^R. In present-day indu- 
ing practices the analysis of document content and the resulting indexing 
decisions are mainly treated as an artful practice. Mellon 15 J emphasizes 
this point; "The making of indexes is an art in itself, involving more than 
a comprehensive knowledge of the general aubject being covered, and the use 
of Indexes is no less an art." It is not surprising therefore that there 
exists no comprehensive treatment of the process of lndexlng--one only finds 
suggestions or examples of how indexing ougli't (in the opinion of the writer) 
to be done. Even publications purporting to provide "Indexing standards" 
are really just promulgating "suggestions." Consider the statement of 
oi USA Stmcb^d Basic 

It lUSA Standard] does not attempt to set standards for every 

detail or for all the diverse techniques of Ind^^g ; these 

should be determined for each index pn^ t^^ the type f 

of matertal Indexed and the type of user for whom it is 




* Author , sub j ect , title, citation , patent number , formula, eta. 




Furthermore, when indexing rules are provided the empliaais la on the 

’’cook-book" approach to Indexing, Favorite topics Include the standardiiation 
of headings, the treatmait of synomyma, cross-references, how to index 
names, how to check index entries, eta. I12J. Such an "artistic" approach 
only reflects the lack of an underlying theory of indexing. A theory of 
indexing must be provided that CTphasizes the importance and centrality of 
the index operation in IS&R processes. 

One of the primary goals of research in I'SSiR and In indexing is outlined 
by Baxendale 113] s 

...starting with a collection of 50,000 dociments ^ich covers four 
snbject fields, which is to grow at the rate of 2000 doctments per 
year, \^lch is to be purged on the basis of activity, and which will 
be subject to approximately 75% specific data requests and 25% 
general queries, what type of Indexing device will beat accomodate 
these conditions? 

Before the development of such an indexlng/retrieval-system nomogram can be 
realized, attention will have to Be directed, for lack of hatter terminology, 
toward the identification and explication of first principles. It is quite 
a conceptual distance between "cook-book" indexing and indexing-system 
nomograms.' The following are samples of things we will need to know more 
about : 

What is the nature of the selective transmission of data? 

Who decides what data is to be passed on to the user? 

by Indexers and analysts? 

What is the function of index languages and devices? 

How are index terms to be selected? 
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How is the quality o£ the index to be evaluated? 

While several additional questions could be added to this list^ those given 
are sufficiently Indicative of the type of problems that will have to be 
considered in the development of indexing theory. 



3. Theory and Model 

In the previous discussion I have presented evidence in support of the 
conclusion that an underlying theory is needed for the processes of 
information storage and retrieval, I also conclude that a theory is needed 
for the crucial representational process called indexing. Considering 
the emphasis that is being placed upon the concept of theory, it is 
appropriate, at this point, to briefly discuss the role of theory in the 
sciences. This discussion is intended as a brief prologue to a consideration 
of the role of indexing theory with respect to information storage and 
retrieval. The interested reader is referred to Kaplan [14] for a detailed 
discussion of the nature of theory and to Caws 115] for insightful 




investigations into the nature of definition. 



An area of investigation, or subject area, is composed of an assortment 
of obser^mbles* variously labeled as data, knowledge, experience and fact 
(we will defer a discussion of the validity of these labels to Chapters 4 



and 5). The study of observables without a basic organisational framework 
is judged to be of low utility. Consequently, theory is an attempt to 
provide an organization forth set of observables or ^pected obseryables. 
Emphasis is placed upon the unification, systematlzatlpn and representation 
of observables (see Figure 3,1 ), Properly formulated, theory is a state- 



* Things that come into our perview; through our sensory mechanisms. 

■ ' ' ' — ■■ rV;. ■ . . ■■ ■_ ■ 
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Figure 3,1: Theory and ^del in Science 
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ment about the inherent structure of the observables. Thus, theory 
simultaneously describes and analyzes the collection of observables. However 
as Kaplan points out I16J , theory must provide more than a simple 
description of observables s 

A theory Is more than a synopsis of rules , . , It sets forth some 
Idea of the rules of the game by ^riilch moves become Intelligible, 

Theory is also expected to be evaluated in terms of its predictive ability. 

Consequently, theory, as a linkage between observables and hypotheses, is 

a guide to the collection and subsequent interpretation of data. The 

structure of the theory is provided by a cohesive set of definitions about 

the observables and about the relations that ^ist between observables. Caws 

makes this clear [17] ; 

Ostenalve definition li.e,, definition by example] is clearly not 
enough. Moreover, a set of Isolated statements about isolated 
phenomena is not yet science^ only when the hems in the statements 
are related to one another does scientific theory emerge. 

Generally, the order that is Imposed on the observables by the theory is a 

consequence of the order that exists between the component definitions. 

The central role of definition in theory cannot he over emphasized. 



Some theories act as models. Irequently some theory about observables 
may either be too difficult to construct, too difficult to understand or else 
unsuitable for the symbolic manipulation of observables. Fortunately, one 
theory may act as a model for another theory. The minimal condition for a 



ttieory/model relationship between two theories is resemblance in form. 

Thus, there is said to be a structural analogy or even isomorphism between 
the theories. If theory A is easier to understand and to manipulate than 
theory B, and if the theories are isomorphic, then the development of theory 
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A will serve as a model for theory B. Hopefully, an increased understanding 
of B will result from this modeling activity. Finally, apart from being a 
conceptual analogy, s good model will be the source of relevant hypotheses 



to be tested on the set of observables (that la, the theory will give rise 



indexing theory that is being proposed in this dissertation, I believe that, 
at our present state of knowledge, a comprehensive and workable theory of 
information storage and retrieval is unobtainable. Thus, we have labeled the 
theory an "unknown theory." Nevertheless, the Indexing theory that is 
presented In Chapter 4 is, I believe, a suitable working model for many of 
the processes of Information Storage and Retrieval in addition to its stand- 
ing Independently as a theory of tti?- processes of indexing and index creation. 
While it is not likely tliat Indexing theory is the theory of IS&R, it is 
believed that it provides a novel and useful interpretation of the associated 
observables, 

4. Statement of the Problem 

The previous chapters: have emphasized the necessity for research in 
information storage and retrieval. However, it is concluded that the most 
significant problem associated with current efforts is the lack of a useful 
theory of Information storage and 'retrieyal. Because of the abaence of such 
a theory, it is: difficult to properly evaluate present-day research and 
aystMis-deslgn efforts. But it would be foolish not to acknowledge the 



to e 




3.1 Indexing as a Model of I5&R 

Figure 3.2 presents a schMatlc which illustrates the role of the 



difficulty of formulating an adequate, all-'lncluslve theory of ISSB,, So I 
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have devoted my attention primarily to the operation of Indexing , working 



under the assumption that Indexing is the central and crucial operation for 
the successful retrieval of information. Thua> it is believed that a theory 
of indexing can serve to model the essence of the information storage and 



As I have said, there is, at present, no comprehensive, unifying theory 
of indexing available for these applications. Repeatedly, indexing viewed 
as an "art" has failed to provide the necessary theory. Consequently, the 
probleifl is to develop a tliaory of indexing that satisfies two criteria# 
first, it must provide the basis for the ayatCTiatic analysis of both induing 
procedures and resulting Indexes ^ and, second, it must provide a conceptual 
basis for the evaluation of IS&R systems# It is toward these goals that the 
research which lead to this diaaertation has been directed * 
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CHAPTER III. PREVIOUS INDEXING THEORIES 



Very little advance in culture could be made, even by the greatest 
man of genius, if he were dependent for what knowledge he might 
acquire upon his own personal observations. Indeed, it might be 
said that exceptional mental ability involves a power to absorb 
the ideas of others, and even that the most original people are 
those who are able to borrow most freely, 

Libby 

Almost any study that is undertaken has a corpus of related and relevant 
literature that must be considered, and, the research reported here being no 
exception, this chapter contains a summary and critical evaluation of two 
previous attempts toward the formulation of a theory of indK.ing, It should 
be noted that these early efforts of theory development were not continued 
beyond their initial exposition in the late 1950 's and the early 1960* s. 
Nevertheless, as we shall see, some valid co nm ents were made about the index- 
ing process. 

The material in this chapter is presented in three short parts: 1) an 

examination of Jonker's [1-3] Indecing-continuum theory, 2) an examination 
of Hellprin*s [4] model of indexing and, 3) a dxacussion of the questions 
which these two studies left unanswered. 

1. The Indexing Continuum 

By way of introduction, Jonker Identified three central problem areas in 
IS&Ri 1) the Indexing problem (the problem of document representation), 

2) the coding problem (the problem of the conyersion from a document descrip- 
tion to a machine recognizable code) and, 3) the machine sys tans problem Cthe 
problem of the selection of the most advantageous code-proceBslng system). 

He concluded that these three factors ultimately reduce to the constraint of 
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cost* Consequently, Jonker beliaved that the goal of IS&R research Cwith 
respect to Indexing) is to construct the most economical indexing system^ 
all the while providing the maximum depth of indexing.^ 

From this reviewer -s viewpoint, the effectiveness and the utility of an 
indexing system are far more Importarit than the coat. Coat caHnot simply 
be equated with dollar outlay* One might, rather, consider a comparison 
between the cost of processing a document and the cost loss of 

utility) of the failure of the system to provide that document to a user, 

I realize that a consideration of cost, in terms of dollars, is important 
' to the system designer, but cost should be relegated to a position of lesser 
Importance in the development of a theory of indexing, 

s 

Jdrilqer does, however, focus on one of the central problems directly 

> 

associated with indexing: 

The inescapable conclusion seems to be that no true understanding 
of existing indexing systems and problems seems possible, unless 
all systems can be seen in the light of more general common 
precepts, linking all those systems together into a closed single 
system, {5] 

Thus, Jonker *s theory of indexing is best described as an attempted taxonomy 
for the classification of the various Indexing systems. This taxonomy is 
based on the belief that IS&R systems do not deal with items of Information 

■ Jonker uses the term "depth of indexing** as a synonjrm for the number 
of index entries per document. This term usually refers to the 

■ ^ hierarchical specif Icity of the index entry, t.e.j depth of detail. 
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s t^amsslvss } but, rather ^ with what he calls VYidsK'^vn^OVTHOft'irOYi, An 
item of Index- information is defined by a specification of both the "number 
of terms" it containB and the connections between such terms,** Jonker 
proposes that the characteriEatlon of index-information is provided by the 
•indecevnff oonivwAwn ^^ich is a composite of the tewtifiotog'icat oontinuWTi and 

the oonneotiv e oontinuwn, 

1 .1 The Terminological Continuum 

An indexing system (by Inference, a system that deals with Index- information) 
provides a meta-language for document description. Such an intensified 
language provides a one-word na m e for every important concept CJ^nker did not 
discuss the nature of "impoTtant concept") , Consequently, the terms of the 
language share a close relationship with both the symbol (document word) and 
the meaning represented Cdonker did not define "meaning") . Figure 1.1,1 is 
a representation of the continuum of intensified language terms—the termino- 
logical continuum. Term size and the divlsibiUty or the pe^mtivity of 
the term are assumed to increase as one moves to the right in the continuum. 
Permutlvlty refers to the variahillty of representation. A particular 
retrieval system is designed, as he saw it, by attempting to place the needs 
of the average user on the tarminological continuum. 

1.2 The Connective Continuum 

The data contained in a single document may be utilized in a variety of 
ways. Jonkdr defines d7‘/^Msenese*** as the number of potential Indexing 

* The word "information" .is occaslonaliy so marked In this discussion to 
remind the reader that the word "data" is meAnt, as 1 see it. 

**/To generallzfe, I interpret Jonker to mean that an item of index- information 
, is an n- term whose elements share a conmion connective relationship, i 

*^*This can be interpreted- as indexing ."hr eadth." ik, 

■■■QD- V-:, 



area of permutivity 



Letters Words Index Information 




"figure 1.1, Is The Tennlno logical Continuum Cfrom [1] ) , 
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points for a given docunient. He claimed thnt it is the diffuseness of the 
information (sic) which characterises (and defines) the information problems 
that exist today. Consequently, dtffuseneas of Information is treated as 
the controllng parameter in his generallEed theory of indexing. 

Jonker believed that the number of Index entries per document is direct- 
ly related to the number of indexing viewpoints that can be identified. 
However, he gives no word on how one might go about choosing a set of 
exhaustive viewpoints. It seems almost obvious that a document should be 
represented in a form amenable to processing by any ajcheme of organiaation 
and/or retrieval. Jonker nearly reached the same conclusion; 

If properly Indexed, an item of information is Indexed by any keyword 
that Is or could possibly become of importance to any potential user 
of the item of Information. [6] 

Apparently, rimporved indexing is achieved by selecting index terms from the 
right of the terminological continuum. The resulting entries are character- 
ized by an Increased number of words per term which is believed to be direct- 
ly prqportional to the degree of hierarahioal aonneotedness of the meta- 
V 

langua'pe. Figure 1.2.1 shows the resulting connective continuum ranging 
betweBi|, short terms and long (multiple word) terms, jonker believed that 
short tferms are representative of coordinate systems and long terms are 
representative of hierarchically-based indexing systems. Thus, the connective 
continutim characterizes indexing systems by their degree of generic character 
(see Pef|y and Kent in 17]). Jonker associated hierarchical classification 
with a l|w degree of diffuseness , Conaequently, retrieval based on the short 
end of tl|fe Gontinuiutt is fluid and arbitrary, whereas the long end is 

I ' , n ■ . . ' , 

charactetnzed by rigid retrieval. It is Inferred that this is the essential 
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difference between complete independence and dependence of terms. 

To summarize, the short end of the continuum Is characterized by high 
diffuseneas, high permutabllity and low hierarchical connectedness, H. The 
long end is characterized by low diffuseness and permutabllity, and high H. 
Jonker concludes that the diffuseness and permutabllity taken together 
determine the retrieval power, R, of the system. This is a measure of both 
the accuracy with which information ( sic) can be lnd«ced and the detail by 
which it can be retrieved. Jonker states that R • H - constant*, or, an 
increase in H can only be obtained at an expense of R, and conversely. 

. 2, An "Intuitive" Mathematical Modal of Indexing 

Hellprin 14] attempted to provide a mathematical treatment of the general 
theory of Indexing developed by Jonker. His first step was to provide a 
formalization of the concepts of diffuseness, permutivlty and hierarchical 
connectedness. For convenience of analysis, Heilprln chose to replace 
(the average term length) by n (the mean number of Independent terms per 
stored item at point Z in the descriptive continuum). Figure 2.1 shows the 
assumed inverse relationship between the two variables. 

Heilprln Introduced the concept of a eeanoh path corresponding to the 
number of "paths" from questions to documents. He contended, as I believe 
rightfully, that the search method is independent of the number of available 
paths— rather, search paths (or permutations) depend only on the index. 
Furthermore, since most indexing systems do not permit full permutabllity, 

* Jonker overlo6lf.fcJ the fact that the precision of the retrieval usually 
increases with' Increased hierarch^ definition. In other words, the 
success of retrleyai is not directly related to the number of permutations 

^^^:^:' avatlableirom;a-feM,;7.;,',^;';.:"i; - ■ - 
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Heilprin Introduced a noise of perrmtabiliiy , N, which Is an expression- of 
the deviation from the max imum . Maxlinuin permutability is reasoned to be the 
set of all permutations of n terms taken q Cquery terms) at a time — i,e,^ 

Cn)q. Similarly, Heilprin Introduced a hierarchical noise, M, which represents 
the discrepancy between the ideal number of hierarchical levels and the 
number of levels by Which a document is Indexed, The following equations 
were derived; 

D = n 

P = Cl-n)(n)q 

H - (l-M)/n 

Accordingly, Heilprin represented all possible iLidexlng systems by a 
3”Space formed by the n, H and P values (see Figure 2,2), This is a restate- 
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merit of the assumed fundamental aquation r 

R*H ^ D«P*H ^ n®P»H 

Claarly, for a single positioning on the deacriptive continuum^ there can be 
various values of P and H depandlng on the independent, but confounding, 
action of the N and M variables. This creates a family of curves all generat- 
ed by a single value of n, Heilprin contends that a complete family of such 
curves would fill much of the Index region * There will be as many index 
curves and regions as there are values of n (recall that n Is the value of 
n at the short end of the continuum) , Although this concept has not been 
further developed, it seems to me that the a l^e and relative positioning of 
the Index region could serve as an analytical^ representation of a given " 

^ indexing system. •• V • • > V-- ' :/ ^ 
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3 . A Brief Discussion of the Two Indexing Models 

Although the presentation of tlte Jonker and Hellprin models has been very 
brief, essential concepts, definitions and relationships which they proposed 
have been presented. I hope the reader can, at least, gain an appreciation 



An i^ortant point relative to this brief review is that a theory has not 
been presented. The mere enumeration of some of the components of the index- 
ing process* (and associated variables) does not answer such questions as 
why index, what should an index provide, or what is the role of the index in 
the process of information storage and retrieval or in human behavior in 
general? However, considerable impetus for the creation of such a general 
theory is provided by their presentations, 

Jonker must be faulted for his over concern for economy and cost factors— 
this is not a constraint for a theory, but rather, just another variable to 
be considered. In addition, it is believed that he has a fundamental mis- 
conception of the concept of information. Does an Index store information 
or does it store data? At least some discussion on this point should preceed 
any general use of, or reliance on, the word "Information," Just as the 
definition of information is loose and non— precise, the desov'l'p'bvOB aontinuwTi 



* There exists an extensive and growing literature concerning the aiialysis 
of Indexing parameters. The usual assortment includes j the type of 
classification scheme; the depth and breadth of indexing; the number of 
terms per entry; the number of entries per document; the number of 
documents per term; the indexing language used; the type of indexing aid 
used Ce.g.f links, roles, weights). It is emphasized that the behavior 
of these parameters is frequently analyzed for systems patterned after 
existing ones. Frequently, the analysis of parameter behavior is model-' 
ed bj? simulation studies (see JS] for a review of these studies). 



of the general "tone" of their theory. 
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must be criticized for lack of a quantitative measure. It is questionable 
whether term length Is a meaningful variable for a meaningful characterization 
of an indexing syst^, Eeilprln^s use of the number of terms/ltm is more 
reallatlCs but remains essentially unsupported. Indeed^ It la not clear that 
Heilprin’s functional analysis C^s presented) really adds anything new to 
Jonker’s theory, A formal presentation of informal concepts must retain an 
element of uncertainity and InfoCTiality. 

In a more positive veinj both Jonker and Hellprln present concepts that 
merit further consideration and development* Each concept surely will find 
its place in a theory of indexing* I concluda this chapter with a listing 
of these concepts: 

• Indexing systems represented by a single closed systein, 

• An index entry represented as a term/relationship structure, 

• Diffuseneas of ’'information* " 

® Indexing structure dependence and independence* 

• Noise must be accounted for in any valid theory, 

• The term- query a ear ah path* ,, 

• The index 2^egion as representative of Indaxing systems, 

/ 



I 
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CHAPTER IV. A THEORY OF INDEXING 



The Master said, Yu, shall I tell you what knowledge is? 

When you know a thing, to know that you know it, and when 
you do not know a thing, to recognize that you do not know 
it. That is knowledge. 

Anateota of Confuaiue (Waley's Translation) 

An index is an array of symbols, syst^atically arranged, 
together with a reference from each symbol to the physical 
location of the item symbolized. 

Mortimer Taubei Studies in Coovdinate Indexing 

1 . Introduction 

In Chapters I and II, I have emphasized that research in information 
storage and retrieval has as its goal the discovery of solutions to the 
problMi of efficiently organizing man’s expanding knowledge. Although a 
variety of approachas have been applied in attempts to solve the problem, 
the discipline suffers from the absence of any underlying model, models which 
are fundamental to any well defined science,* It appears that much of the 
effort to develop such models (see Chapter III), although well-intentioned, 
is misdirected because there is little appreciation of the theoretical; 
foundations of information storage and retrieval. As a start toward resolv- 
ing some of these difficulties, the el em ents of a basis for a theory of 

] 

! 

infbrmation storage and retrieval are set forth in this chapter. It is 

hypothesized that the theory can best be formulated and expressed in terms 

/ 

/ 

oi£ a general theory of indexing. 

* Recall the definitions of model and theovy given in Chapter II, 
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In the first section of this chapter is stated the basic premise of the 
theory j and a number of fundamental defxnitions are given. Following this 
there is a discussion of the similarities between the indexing process and 
the general communication process. Attention is then directed to the view 
that Indexing is an order increasing operation, and some thermodynamic 
notions are invoked to aid in this description. The concept of a "thcioret- 
ical index" is then elaborated and compared with real-world indexing systems 
Finally, the contribution of the human periormanee variable to the efficacy 
of an indexing systm is considered. 

Just a note on organization. This chapter is divided into two parallel 
parts, each of which contains nine sections. The first part provides a 
concise exposition of the theory of indexing. The second part gives support 
ing data and discussion related to the materials presented in the first part 



of the chapter. 

2. Somi First Definitions and Postulates 



It is assumed that any theory about processes in the real world must 
involve the operation of measurement and the specification of units. 
Accordingly, the concept of data etment is postulated to be the fundamental 
unit of documentation. The following four definitions are treated as 
antecedent to the definition of data element. 



Def . 2.1 Measuremen+: MeasurCTient Is the process of selecting 

among a set of possible alternatives exactly 
those which characterize the attribute under 
observation. 



Def. 2,2 Attribute; 



An attribute Is any dlsci'iminable feature of 
an event that is susceptible to some dls- 
crimlnable variation from event to event 
(Bruner [1]). 

An attribute is a subset of the set of all 
possible observab as associated with an event. 




or, 
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Def. 2.3 Unit of Measure ; 

A unit of ffleasute Is a netviQ which is defined 
by the function A x A N (natural numbers) , 
which assigns to each pair a,b e A a non- 
negative real nuiiiber p(a,b) and such that the 
following properties holdi 

1) p(a,b) = p(bsa) V a,b 

2) p(ajb) = 0 iff a = b 

3) p(a,b) + p(b,c)^p(a,c) 



Def. 2,4 Precision: 



or. 



Precision is the number of alternative values 
for the result of the operation of mesaure- 
ment . 

Given Sjb s A, a metric p is more precise than 
a metric p’ if P(a,b) < p'(a,b). 



Thus, we now have; 

Def. 2.5 l^ata Element: 



/ 

/ 



A data element, d, is the smallest thing which 
can be recognized as a discrete element of 
that class of things named by a specific 
attribute, for a given unit of measure with a 
given precision of measurement. 



The following definitions build on the concept of data element; 

Def. 2,6 F ^e lation : Given sets of data elements d, jd^ ,dg , , . . ,d^ 

(where d, 



Def. 2,7 Ordered set: 



{dfc ,d, l£k£n), form the. 

I ^ n 

cross product xd^xdg x , , A 

k-1 

relation, R, is a subset of this conjunctive 
n 

set: Rc'J~|' d^. 

k^l 

A set of data elements Is said to be ordered 

by a relation R (over the data elaments) if 

the relation is transitive and satisfies the 

trichotomy law (d,Rd. or d.Rd. or d.®d. where 
- 1 3 3 1 13 



n 



d . ,d. E 
1 3 



V' 



k=l 



r% 
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Def. 2.8 We I i -Ordered Set: 





An ordered set of data elements is said to be 
well ordered if its every non-void subset has 
a first element. 


Def, 2.9 Docunent: 


A document, D, is a well-ordered set of data 
elements , 


Def. 2.10 Document Space: 

A document space is an ordered set of 
dociments. This set Is denoted by: 

S’ ^ 


Def .2.11 1 ndex Space: 


An index space, 3 * is a representation of 
the data elements, d, and relations, R, found 
in the indexing systm (defined in Section 4) 


Theorem 2.1: 


J Is a dociunent and J s S, 


Def. 2.12 i ndex: 


An index, I, is the image* of composite order' 
preserving mappings performed on the document 
space ^ . 


Theorem 2,2: 


I is a document. 


Def, 2.13 Query: 


A query, Q, is a wall-ordered set of data 
elanents such that Q c i (af. Def. 9,1). 


Theorem 2.3: 


Q is a document 


Postulate 2.1: 


I = f(S,3 ) > where f is the indexing 



process (of. Daf, 4.1), j 

Postulate 2.2: Accurate retrieval depends upon the exactness 

of the Indexing. / 

3, Communication and Indexing 

Information storage and retrieval Is inherently a part of coirEnunl'ca^on. 
In fact, it can be argued that information storage and retrlevai is central 
to all of our activities. It is thus necessary to formalize the nature of 
the ties between information storage and retrieval and communication,. 



* Given a function f jS T 3 V s e S 3 f(s) e T, we say that f(s) la the 
image of the mapping defined by f. 
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Dsf • 3.1 Commun i cat i on ; Communication is a closed system consist- 
ing of an effector, a receptor, a trans- 
mission channel and a feedback unit C<3f, 

Fig. 3.1). 

Oef , 3,2 Flow Rate : The communication or flow rate is measured 

in data elements per unit time. 

f^ostulate 3.1' " e items transfered from element to element 

in communication are data elements and 
associated relations. 



Postulate 3,2: Any theory or practice of co mm unication 

which causes a loss of data elements , either 
through their misrepresentation or by 
restricting their flow, must be considered 
Inadequate. 

Accordingly, it is assumed that accurate and effective eommunleation is the 
goal of an IS&E system. The following definitions consider the nature of 



@ff@otive communication 



Def . 3.3 Experience Set : The source's or receiver's memory is modeled 

as an ordered set of data elements and 
relations. Denote the experience set by (ES) , 

Theorem 3,1: An experience set is a document. 

Def. 3,4 Interface Experience Set : 

The interface experience set (lES) represents 
the data elements and relations that are used 
in tha ©ctual communication between the source 
and the receiver. 

Def. 3.5 Effective Communication : 

Effective conmmnlcatlon is obtained when the 
intersection of the source experience space ^ 
(ES) and the receiver experience space, (ES) , 

8 r 7 f 

is non-empty, i.e., (ES)^ fl (ES)^ ^ 

Theorem 3,2: Effective conmunication is maximal when 

(ES)g » CES)^. 

Def. 3.6 Experience Set Transformations ; 

Experience set trangformations are defined 
by sets S and R whose domains are (ES) and 

(ES)^, respectively. These transformations 
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Figure 3.1; 



A General Model of Gommunlcation. 
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have the following property; 

S»(ES) = (lES) = R‘(ES) 
s r 

(see Figure 3 . 2) • 

Theorem 3.3; An ISSR-system user must have knowledge of 

the organization and representation of the 
data elements in the system to achieve 
effective communication with it. 

Postulate 3,3: The indeaing eygtm (af. Def. 4.3) provides 

the interface experience set and the trans- 
fonnatlons required for effective communica- 
tion. 

One of the transformation functions in the indexing process deals with the 
order of the data elements that occur in the communication link. This order- 
defining transformation is based on the definition of five exhaustive, over- 
lapping* classes of data-element relations: 

Def. 3,7 Data-Element Relations : 

A data element relation is an element of the 
set of relations, REL = {EjG,?,!,!} defined 
over sets of data elements d = {d^,d^,...} 

and sets of attributes A = {a,b,c,..,}. 

The relations comprising the set REL are defined as follows: ; 

Def, 3.8 Equivalence Relation : 

An equivalence relation, 1, satisfies the 
following properties: 
d^Edj (reflexivity) 

■ ‘ d.Ed, -d,Ed. (S^nimietry) 1 

1 J 3 1 

d.ld. & d.Ed, =3 d.Ed, Ctransltivity) 

13 3 K 1 k 



* That is, a pair of data elements may be related by combinations of 

these relations. For instance, we write aEFb , to mean both relations 
E and F operate on data elements a and b. 
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Def . 3.9 Generic-Specific Relation: 



The generic-specific relation, G, is defined 
by d^ "is generic to" d. or, equivalently, 

by d. > d.. . G is reflexive, transitive but 

1 j 

not symmetric. 



Def. 3.10 Part-Whole Relation; 



A part-whole relation, P, is defined by: 
d^ "is a part of" item X, or, equivalently, 

by d^ e X. P is only reflexive. 



Def. 3.11 Difference Relation: 



A difference relation, F, is defined by: 
d^ "is not equal to" d. or, equivalently, 

by d^ ^ dj . F is symmetric and transitive. 



Def. 3.12 Intensional Relation : 

An intensional relation, T, is defined by: 

d. "is defined as" d. where d. is an item 
1 J 1 

and dj is a name. T is only transitive. 

Thus, the order-defining transformation, a* , is defined as follows: 

Def. 3. 1 3 Order-Defining Transformation : 

An order-defining transformation Cfe S 
(a/. Def. 3.6) is a mapping from strings of 
data elements into REL : 

7*^2 ’ * * * ’*^n^ REL(d^ ,d^ , . . . . 



Theorem 3.4: 
Theorem 3.5: 
Postualte 3.4: 



Transformation & partitions D. 

Transformation <y partitions 

The transformation identifies patterns 

of data elements 



4. The Role and Position of the Indexing System in the Communication Process 
In Section 3 a general definition of the term oormunioation has been 
given. In addition, some preliminary remarks have been made concerning the 
nature of the indexing operation. At this point, the position of the index 
in communication is viewed in terms of an adaptation of the Shannon-Weaver 
generalized communication scheme [2]. Definitions are now presented to 
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characterize the nature of the transmission ahannel. 

First, let us consider the definitions of indexing proaess, system 
and indexing system. 

Def. 4.! Indexing Process : 

The indexing process is characterized by the 
operations of identification (recognition) 
and representation of data elements and 
relations . 

Def. 4.2 Systerri: A system is that portion of the universe 

~ chosen for observation and measurement. 



Def. 4.3 Indexing System : 

An indexing system is a system for the 
application of the indexing process to the 
document space. The output from the index- 
ing system is the index. 

Now, we shall define the position of the indexing system in the 
communication process. 

Def. 4.4 The Location of the Indexing System in Communication : 

The indexing system is an intermediary 
between the transmission channel and the 
receiver. 

The indexing system is affected by noise. The output of the indexing system, 
the index, is viewed as intermediary between the channel and the receiver 
(see Figure 4.1). The input to the indexing system is characterized as a 
document stream. 

Def. 4.5 Document Stream : 

The input to the indexing system from the 
communication channel is called a document 
stream which is defined as a heterogeneous 
collection of apparently un-related documents, 
ordered by their time of arrival at the 
indexing system. 

For convenience of definition we shall grant the indexing system the 
ability to sample the document stream for fixed periods of time. According- 







SOURCE 



i i 



49 




Figure 4.1: The Shannon-Weaver Model of Communication Adapted 

to Include the Indexing System and the Index. 

Note the Role of Noise and Feedback. 
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Def . 4,6 I nput Time Slice : 

An input time slice is a section of the 
document Stream corresponding to a fixed 
interval of time t, t « T (where T is the 
time required to receive the entire document 
under consideration), that is isolated for 
observation and processing. 

The indexing system must recognize relations 
(from REL) between data elements both within 
alid between time slices. 

The indexing system recognizes inter- and 
intra-document data element relations. 

The vote of indexing can now be defined; 

Def. 4.6 The Role of Indexing : 

Indexing is a procedure for identifying 
relations that completely specify the flow 
of data in the document stream at any point '■ 
in time. 

Theorem 4.2: The indexing process is reversible: it must 

allow for the reconstruction cf the original 
document flow. 

Unfortunately, real-world indexing practices deviate conside: ly from the 

effective structure of the indexing system described above. The following 
postulate allows for the existence of error. 

Postulate 4,2; Current indexing practices serve to obscure 

the unique organization between data elements 
in documents . 

5 . The Ordering Properties of the Index 

In the first four sections of this overview of indexing theory, we 
have considered successively the definition of some fundamental concepts, 
the definition of communication, the nature of experience-set transformat- 
ions, types of relations applicable to document representation, and, finally, 
the role and position of the indexing system in the communication process. 
Attention is now directed to a further characterization of the indexing 








Postua I te 4.1: 



Theorem 4.1: 
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system, considering especially the index as a bi-directional interface 
b6tW6en th6 document collBCtion and the receiver. 

A document hao been described as an author^-assembled ^ wsll^ ordered 
collection of data elements. It is inferred that these data become 
information only when they are assimilated or put to use by the receiver(s) . 
Accordingly, 

Def. 5.1 I n lOrmation : Information is defined as data elements of 

~~ value in decision making. 

Clearly, data elements must be available at the proper time and in the proper 
form to be of value in the decision-making , process . To insure accurate data 
transfer, the indexing system must produce an index that is a facsimile of 

the system’s parent documents. Thus, 

Theorem 5.1: Accurate and complete document representation 

is the fuijiction of the indexing system. 

The indexing system draws on a bipartite document space to effect this 

representation. The two components of the document space are defined as 

follows: 

Def. 5.2 Input Documents : 

~ Input documents, documents which 

arrive at the indexing system uta the 
transmission channel. These are the 
documents that the indexing system will 
represent. 



Def. 5.3 Analysis Documents : jy 

Analysis documents, «(/ , are documents which 

Qi 

describe the transformations, S. These 
documents reside permanently in the system 
and are used as aids in the representation 
operation. 
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The document space can now be described. 

Theorem 5.2: U , 

62 



Theorem 5.2: 
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The representation of an input document by the indexing system can be 

expressed as a set-product operation! 

Theorem 5.3: ' Input-document representation = 

eS . ^ ^ * REL. 

1 a 1 



It follows as a consequence of definition 2.12 that. 

Theorem 5.4: I ~ 

The function, g, generates the index entries, where: 

Def. 5.4 Index Entry: An index entry i e I is an expression such 

' that the following data-element relation 

holds : 

d RELd, . Where for each j , 3 {h} 3 d RELd 

j k j 

V j,k. 

Figure 5.1 is a pictorial representation of these operations. It is interest- 
ing to note that this framework allows for a recursive definition of an index 
an updated index, I^, is formed through a combination of the old index, I^, 
and the new elements of the document space: 

Theorem 5.5: ® ^ 

The operation of the indexing system is characterized by the index 

space, D . The following definitions are required for this characterization. 

De-f. 5.5 Vocabulary: The vocabulary, V, is a set of possible ^ 

data elements in a document space, ordered 
by precision of measurement. Subsets 

V c V of this continuum describe those 
i — 

data elements recognized by a particular 
indexing system. 

De-f. 5.6 Transmission Decoding: 

~~ ' Transmission decoding, TD, is a set of 

possible productions defining strings of 
data elements over V. Subsets TD^ c TD of 

this continuum describe those productions 
employed by a particular indexing system. 




r% ^ 
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INDEXING SYSTEM 





Figure 5.1: The Indexing System. 
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Def. 5.^ Language; Language, L, is a set of possible expressions 

(strings defined by TD plus lelations from 

REL'i. Subsets L. c L of this continuum 
1 _ 

describe the expressions employed by a 
particular indexing system. 

The index space is now alternatively defined as follows; (.eft Def. 2.11). 

Def. 5.8 Index Space ; The index space, 7 = V x td x L. 

The concept of an index space provides a useful framew -k for analyzing the 
retrieval process. A specific request initiated by the rect'-iver must be 
f rmulated as an element of the index space: 

Theorem 5.6; ^Q* ^ ^ ' 



Thus , 

Def. 5.9 The Process o f Retrieval ; 

A homomorphic mapping of the request data- 
elements and relations into the index space. 

Consequently, we have the following homomorphic mappings: 

Theorem 5.7: Q 1 , 

1 

Corollary 5.1: There exist as many homomorphic mappings 

(Q 3 ) as their exist individual receivers 
in communication during a specified time 
interval . 

Corel I ary 5.2: I is a bi-directional interface between Q 

and tS • 

6. Indexing as an Entropy-Reducing Operatio n 

We now consider an alternative way of characterizing the operation of 
the indexing system, namely, that the indexing system increases the order 
of the data elements in the document space. More explicitly, the 
specification of a structure upon which measurement is effected yields a 
reduction in thermodynamic entropy by increasing the intrinsic order of the 
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system under study. The imposition of an explicit order {i.e,, order 
relations selected from REL) upon the elements of the structure also amounts 
to a decrease in communication entropy. For the moment we shall content 
ourselves with a fuzzy definition of entropy (the reader is referred to the 
parallel section 15 for an overview of the alternate definitions of 
"entropy") : 

I 



Def. 6.1 Entropy : "... a measure of the lack of information 

about the actual structure of the system." 
(Brillouin [3]) 

or, A measure of the incompleteness of the data 
from which we infer the r.tate of the system. 

Documents, ^ that arrive at the indexing system are (ignoring chronology) 

in a highly disordered state because there exist no overt data-element 

connections across document boundaries. Accordingly, 

Postulate 6.1: The indexing system recognizes and makes 

explicit inter-document data-element 
relationships . 



The indexing sy-stem, in its organization and recognition operations, defines 



a "phase space of data elements intermediary between the document space and 
the receiver. Thus, 



Def. 6.2 Phase Space : A phase space is a definition of the 

accuracy of measurements based on the 
division of the document space into well 
defined units. 

or, The specification of two document coordinates 

a) configurational coordinates that 
depict which data are stored, and 

b) momentum coordinates that determine the 
particular sequence of configuration 
coordinates involved in the document 
representation . 



The two coordinates of the phase space describe the storage and search 
operations associated with the use of the index. Consequently, 
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Theorem 6.1: The phase space is isomorphic with the 

index space, J . 

The ordering of data elements, by means of a phase space, amounts to 
a reduction in entropy: 

Theorem 6.2: The order-preserving and increasing 

properties of an indexing system amount to 
a reduction in the entropy of the document- 
space/document-space-searcher system. 

Since such a reduction in entropy must be accompanied by an increase in 

entropy (t.e., by an expenditure of energy) elsewhere in the system, we have: 

Theorem 6.3: The entropy decrease which results from the 

creation of the index, is balanced by the 
entropy increase associated with the effort 
needed to obtain the coordinates of data 
elements in the phase space. 

Finally, we postulate a relationship between the work expended in indexing 
(the specification of phase space coordinates) and the information desired: 

Postulate 6.2: The probability of a given set of data 

elements becoming information is a function 
of the work expended by the indexing system. 

7. The Concept of Benefit 

We have so far been concerned with the recognition and representation 
of data by the indexing system. It has been emphasized that data must be 
in the proper form and must be available at the proper time to be of use to 
the receiver (decision maker) . When the conditions of form and availability 
are fulfilled, we say that the data becomes information. However, there 
remain the questions: What is data of vatue'l and How is the searcher to 

benefit from the existence of such information? The answers to these 
questions are found in a consideration of the concepts of Qaal^ hypothesis 
testing and decision making. 
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It is assumed that a goal represents a desired end product or end state 
of the receiver. A goal may be a.' simple as "the retrieval of any document 
on subject X" or as complex as "the winning of a game of chess." Thus, 

Def . 7.1 Data of Value: Data are of value when they are used in 

the accomplishment of a goal. 

In the retrieval process, the goal of the searcher is achieved through a 
hypothesis-testing and decision-making chain. Hypotheses are posed by the 
receiver concerning the data store (.e.g.i concerning the contents of the 
document space) and the retrieved data may provide information leading to 
the decision which results in goal achievement. Figure 7.1 shows a structure 
of possible goal-directed paths of which we define two extreme cases: 

Def. 7.2 Path of Maximum Benefit : 

The H - D - G path is the path of maximum 
benefit where H = hypothesis, D = decision 
and O = goal. 



Def. 7.3 Path of Mininu 



' jothesis path, denoted by ” - H - H ... 
is the path of minimum benefit. 

Clearly a decision must be made in the minimum benefit case to 
formulate a new hypothesis based on the data retrieved in support of the 
previous hypothesis. But this decision will be treated as less significant 
than the goal-achievement decision associated with definition 7.2. Thus, 

Def. 7.4 Me+a-Deci s i on : A meta-decision is a decision which does not 

~ ^ lead directly to goal achievement (frequently 

associated with the progression between 
hypotheses). 



and, in addition; 

Def. 7.5 Meta- I nformatlon : 

Meta-information is data elements of value 
in meta-decision making. 





Clearly, 
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Figure 7.1: The Hypothesis-testing and Decision-making 

Chain. The double-bonded path represents the 
minimum benefit case; the enclosed path 
represents the path of maximum benefit. 

(H = hypothesis, D = decision, G = goal) 
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Theorem 7.1: Information is not equivalent to meta- 

information. 

At this point we are better equipped to define the concept of benefit: 

Def . 7.6 Benef i t : Benefit is a relationship between the 

information obtained and the number of 
decisions required to reach a goal. 

In addition, benefit, B, obtained from indexing is accumulated over a 

considerable interval of time (measured at times , thus: 

N 

Theorem 7.2: B B^ , 

i=l 

where B^ is benefit measured at time t^ and, 
3N5B^=OVi>N. 

8. Theoretical vs. Real-World Indexes 



Indexing systems are, by definition, imperfect because the associated 
ordering measurements (classification) are inherently uncertain. The 
Heisenberg uncertainity principle [4] applies to the specification of 
elements of the indexing phase space. Thus, there is always the possibility 
of misinterpretation (misrepresentation) of data elements. Clearly, the. , 
there is a certain amount of "noise" or "error" built into an indexing 
system because of the inherent limitations of the associated classification 
method . 

Based on the premise that indexing systems are imperfect, one must be 
able to distinguish between "perfect" and "imperfect" indexing systems. 

This distinction is sharpened through the definition of theovet-iaal and 
r'eat-wovld indexes. Thus; 

Def. 8.1 Theoretical index : 

The theoretical index represents all inter- 
and intra-document relations between data 
elements in the document space. Order- 
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preser'/ing operations are einployed at all 
steps of the indexing process * 

If we assume that there are d data elements in a given document of the 

collection, then the theoretical index must permit at most the existence 

of 2^ connections between these data elements. But, since there are many 

documents (say m of them) in a given document space, one must allow for 

the existence of many more data-element relationships. 

Theorem 8.1 : The theoretical index must be able to 

represent any subset of 
d d 

{2 ^..,2 data-element relations. 

The real-world index is now defined: 

Def. 8.2 Real-World I ndex: 



Thus , 

Postu I ate 8.1: 

9. The Human Limitati on 

The presentation, up to now, has been concerned with a systematization 
of information storage and retrieval by means of a theory of indexing. 

This theory rests essentially on a formalization of the notions of data 
element and relations. However, the implementation and modeling of an 
information storage and retrieval system are not simply abstract constructs, 
but are engineering processes involving the human factor. This section will 



Real-world indexes contain, for a given 

document space, a number of valid index 

entries (i?/. Def. 5,4) N such that 

R 

Nr « where is the number of valid 

index entries contained in the theoretical 
index for the same document space. 



For a given Jj , real-world indexes fall short 
of the theoretical index because the indexing 
of the document space is incomplete. {of. 
Postulate 4.2). 



• 
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consider the nature of the interface between the indexing systen (the index) 
and the receiver, and show that the abstract construct of an ideal index 
must be tempered by a fuzzy theory of human query-formulation and decision- 



making processes# 

Very rarely is the receiver, who utilizes a real-world index, 
completely satisfied with the result of an initial query. Either he has 
an incomplete understanding of the organization of the system or he is 
unable to adequately formulate a hypothesis about its contents. The follow- 
ing definition of query is an extension of Definition 2.13. 

Def. 9.1 Query : A query is a hypothesis about the contents 

of the document space, oQt. (of. Def. 2.13). 

Postulate 9.1: The maximal and minimal paths (Def. 7.2 and 

7.3) of inquiry have a small probability of 
occurrence. 



Consequent ly , 

Theorem 9.1: The first data element retrieved, in response 

to an initial query, is likely to be only 
partially beneficial. (of. Def. 2.13) . 

Coro I I ary 9.1 : Benefit can only be maximized through repeat- 

ed interaction between the receiver and the 
index. 



In the maximal benefit case (Def. 7.2), the data element that provides the 
information is said to have maximal utility or value. However, in any 
intermediary case, the utility of an information-providing data element is 
decreased because the data element is one of a sequence of retrieved data 



elements. Thus, 

Def. 9.2 Decay ; 




A data element received at time has 

lower value than the same data element 
received at the time t^. (Here time is 

measured relative to the start of the inter- 
action between receiver and index.) This 
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Theorem 9.2: 



Coro I I ary 9.2: 



Coro I I ary 9.3: 



decrease in value is called decay. 

The decrease in utility of a data element 
is directly re3.ated to its position in a 
string of retrieved data elements. 

The utility of any data element decreases 
with the number of hypothesis-testing and 
decision-making steps which preceed its 
retrieval . 

The utility of the last data element used 
to reach a goal is a function of the benefits 
derived from the use of the previous data 
elements . 



It is postulated that data elements exhibit a Poisson-like behavior 
in their role in decision making. Consequently, the value or any data 
element in decision making diminishes with time (s^e Figure 9.1). Hox^ever, 
as the hypothesis becomes more specific, the rate of loss of utili-ty of a 
data element also decreases (see Figure 9® 2). 

Postulate 9.2: The value (utility) of a data element, with 

respect to goal achievement, is Poisson 
distributed. 

Postulate 9.3: Data elements are indistinguishable wif' 

respect to their value distribution: 

value (d)^ = value (d^)^ 

n n 

and, finally 

Postulate 9.4: The rate of decay of the utility of newly 

retrieved data elements decreases with increas- 
ing path length in the H - D - G structure 
(Figure 7.1). 

1 0 . Interregnum 

To R. L. Collison [5], "The trouble with indexing is that even today 
/ we are still at the elementary stage of learning how to do it. We do no 
know enough about its technique" and we certainly do not know enough about 
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Figure 9.1: Data-^lement-value Distribution over Time. 
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Figure 9.2: Data-element-value Decay as a Function of 

Successive Interactions Between User and 
Index. 
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its theory . Indexing, and its associated paraphernalia, constitute a 
strange process. Consequ.*^ntly , the researcher is confronted by an interest^ 
ing situation: on the one hand examples of the product of indexing, the 

index, are plentiful and ubiquitous; on the other hand, attempts to 
formalize either the process of indexing or the relationship between its 
exemplars are virtually nonexistent. The previous nine sections of this 
chapter constitute an attempt to remedy this situation by presenting a 
formal description (and interpretation) of the ^^indexing process”. 

The exposition of the theory was designed to be brief and terse, 
consequently, a summary is not easily presented. As a form of summary, the 
postulates presented in the previous sections are listed, as a group, as 
being indicative of the scope of the theory presented. 

2,1; X = f(oD*, 3 )> where f is the ^ndexrng process, 

2,2: Accurate retrieval depends upon the exactness of the indexing, 

3,1: The items tr^.nsfered from element to element in communication 

are data elements and associated relations, 

3,2: Any theory or practice of communication which causes a loss of 

data elements, either through their misrepresentation or by- 
restricting their flow, must be considered inadequate, 

3.3: The indexing system provides the interface experience set and 

the transformations required for effective communication. 

3.4: The transformation identifies patterns of data elements. 

4,1: The indexing system recognizes inter*- and intra-document data 

element relations , 

4.2: Current indexing practices serve to obscure the unique 

organization between data elements in documents. 

6,1: The indexing system recognizes and makes explicit inter- 

document data-element relationships. 

6,2: The probability of a given set of data elements becoming 

information is a function of the work expended by the 
indexing system, 

/ Q 
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8.1: For a given oCT, real-world indexes fall short of the 

theoretical index because the indexing of the document 
space is incomplete. 

9.1: The maximal and minimal paths of inquiry have a small 

probability of occurrence. 

9.2: The rate of loss of utility of a data element is Poisson 

distributed. 

9.3: Data elements are, indistinguishable with repsect to their 

value distribution. 

9.4: The rate of decay of the utility of newly retrieved data 

elements decreases with increasing path length in the 
H - D ~ G structure. 

For further clarification the reader is referred to Figure 10.1. This 
figure presents an overview of the various schema associated with the index- 
ing theory. The conceptual steps that lead from generalized communication 
to the characterization of the indexing system are depicted. The final 
level of analysis, the index space, J , is interpreted as a repi'esentation 
of the operating limits of the indexing system. 

It is interesting to note that the discussion of the previous sections 
has been predicated upon the existence of three conceptual classes: sets 

of documents, sets of attributes, and sets of relationships expressing a 
connection between documents and attributes. These are the fundamental 
entities of any IS&P, system and must be incorporated in the characterization 
of effective aornmuniaat'ion. The ideal index has been chosen as the 
standard for effective indexing. Albeit unobtainable, the ideal index 
serves as a useful comparative device. By analogy, the ideal index operates 
in a manner similar to the ideal game player (adapted from Garfinkel I6J): 

He never overlooks a message; he extracts from the message all the data it 
bears; he names things properly and in the proper form; he never forgets; 
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he stores and recalls without distortion; he never acts on principle but only 
on the basis of an assessment of the consequences of a line of conduct for the 
problem of maximizing the chances of the effect he seeks, 

A theory of indexing must obviously account for error but, more import- 
antly, it must provide guidelines for the maximization of document represent- 
ation fidelity. The next eight sections (each one parallel to a section of 
the overview part of this Chapter) vill present arguments for and a further 
exposition of the indexing theory, Ttie goal is to at least partially 
establish isomorphism between real-world-indexing practices and the interpret- 
ations of these practices embodied in the theory, 

n , Data Element as the Basis 

The theory of indexing that has been presented in the previous sections 
has relied heavily on the concept of data element. It has been assumed that 
data element is the fundamental unit of documentation and, accordingly, 
provides the basis for many of the concepts and relationships developed in 
the oTheory, Following Sorgcl [7] (who was concerned with the concept of 
heywovd) , three important features of the concept of data element can be 
identified : 

1) The concept of data element allows for independent manipulation, 

2) A data element dees not decompose into two or m.ox'e units, 

3) A data element h<*s a definite meaning or interpretation. 

These features were incorporated (albeit implicitly) into the definition of 
data element {of, Def, 2,5) and were viewed as consequent to the definitions 
of measurement attribute^ unit of measure and precision. The presentation 
in Section 2 began, rather abruptly, with the definitions of measurement 
and attribute. As an alternative, and to counter a possible objection that 
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these first definitions were ^’pulled from the air^*^ we shall consider a 
more formal development of the concept of measurement, a concept that is 
antecedent to data element. 

Before undertaking the further development of data etementy let us 
introduce a document by means of which the various concepts discussed may 
be exemplified. This document is shown in Figure 11.1 together with 
examples of index entries involving differing definitions of data element. 
Many alternative derivatives of this document appear throughout the remaind- 
er of this chapter. 

Let us adopt an essentially mechanistic view of the world and consider 
that all events (the word "event” is left to the reader to define) are the 
outputs of machines. Accordingly, 

Machine : A machine is a black box which accepts 

inputs and emits outputs, (see Figure 11.2). 

Thus an output, or event, is somehow paired with an input by means of a 

"black box." Although such a definition is all inclusive, it offers little 

in a descriptive sense. For increased specificity the following definition 

incorporates a theory of the operation of the black box: 

T uring Machine : A turing machine, Tm, is denoted by 

Tm = {K,r,6,0,F,q} where: 

K is a finite set of states; 
r is the finite set of s 3 nnbols from which 
the inputs and outputs are obtained; 

6 : K X r ->• K is the next state function; 

0; Kx r T is the output function; 

F K is the set of final states; and 
q £ K is the start state. 

The next three definitions arise immediately from that of turing machine: 
Observables: Observables are elements of F. 
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Lancet 

820 EFFECT OF A SELECTIVE BETA-ADRENERGIC BLOCKER IN 

PREVENTING FALLS IN ARTERIAL OXYGEN TENSION FOLLOWING 
ISOPRENALINE IN ASTHMATIC SUBJECTS. 

LANCAo.2.7630.69.1 092-3 

Palmer KNV. Legfie JS. Hamilton WFO. Oiament MLiDep. Med,, Univ. 
Aberdeen, Aberdeen. Scot. 



6673354 4-{2-Hydroxy-3-isopropylaminopropo«y)acetanlllds 
(practolol) 

(20 mg/subject, i.v.). 3 /S'-AORENERGIC BLOCKING 
agent 

selective to the HEART, prevented the decrease In 
ARTERIAL 

OXYGEN TENSION in 1 1 ASTHMATIC patients 
following 

7683592 Isoprenaline (0. 1 mg/subfect aerosol inhalation) 
treatment 

without significantly decreasing the 
BRONCHODILATOR action 
of 

isoprenaline. 







7683592 



p-(M*CWH)C,.HMOCH.CH(OH)CH»NHPr-.so 

6673354 



Figure 11.1: An Example Document 

[CBACA^, vol. 11(2), 1970, p. 119] 

(Reproduced with the permission of the 
Chemical Abstracts Service, Columbus, Ohio) 
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LANCET, 2, 7630, 69, 1092-3. 

1069 tokens 
408 types 



Index Entry (KWIC) 



Frequency in text 



Acetanilide 1 

Adrenergic ■ 2 

Aerosol 1 

Arterial . 7 

Asthmatic 4 

Blocker 1 

Bronchodilator 2 

Heart 1 

Hydroxy 1 

Inhalation 1 

Isoprenaline 14 

Isopropylaminopropoxy 1 

Practolol 9 

Tension 3 



Figure 11.1 (cont.): 



Frequencies in Original Text. 
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Document 




Molecular formula entries 

Faceted entries 

Uniterm entries 

Multi-term entries 

Articulated entries 

Key-word- in-con text 
entries 

Author entries 



Where the following are examples of the index-entry 
transformation, g: 

g : Molecular formula entries 
a 



820 




820 




g^: Faceted entries 


820 


Receptor, beta(l) 


820 


Receptor 5 beta (2) 


820 


Drug; beta -blocking 


820 


Muscle, Bronchial 


820 


Asthma, Bronchial 


Figure 11,1 


(cont.) : Example Indexes 
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Uniterm 


entries 


820 


Isoprenaline 


820 


Receptors 


820 


Drug 


820 


Myocardium 


820 


Myocardial 


820 


Contractility 


820 


Oxygen 


820 


Tension 



g,: Multi-term entries 

d 

820 Beta-blocking Drug 

820 Myocardial Contractility 

820 Bronchial muscle 

820 Oxygen tension 

820 Bronchodilator activity 

820 Blood-gas tension 

820 Bronchial Asthma 



g^: Articulated entries 

820 beta-adrenergic blocker 

« effect of, in preventing falls 

following isoprenaline in asthmatic subjects 

• in preventing falls following isoprenaline 
in asthmatic subjects, effect of 

820 asthmatic subjects, 

• effect of beta-adrenergic blocker in 
preventing falls in, following isoprenaline 

• following isoprenaline, effect of 
beta-adrenergic blocker in preventing falls in 



Figure 11.1 (cont.): Example Indexes. 
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Key-word-in-context entries 



sopropylaminopropoxy) ACETANILIDE (practolol) (20 mg/subject, i.v. 

OF A SELECTIVE BETA- ADRENERGIC BLOCKER IN PREVENTING FALL 
; /subject, i.v.) a g- ADRENERGIC BLOCKING agent selective to the 

(0.1 ing/subject, AEROSOL inhalation) treatment without signifi 
i’ N PREVENTING FALLS IN ARTERIAL OXYGEN TENSION FOLLOWING ISO 

the decrease in ARTERIAL OXYGEN TENSION IN 11 ASTHMAT 

OXYGEN TENSION in 11 ASTHMATIC patients following Isoprenaline(o 

OWING ISOPRENALINE IN ASTHMATIC SUBJECTS. 

CTIVE BETA-ADRENERGIC BLOCKER IN PREVENTING FALLS IN ARTERI 
cantly decreasing the BRONCHODII ATOR action of Isoprenaline. 
gent selective to the HEART, prevented the decrease in ARTERIAL 0 

4-(2- HYDROXY-3-isopropylaminopropoxy) acetanilide ( 
f 1 mg/subject, aerosol INHALATION) treatment without significantly 

j IC patients following ISOPRENALINE (0.1 mg/subject, aerosol inhal 

\ GEN TENSION FOLLOWING ISOPRENALINE IN ASTHMATIC SUBJECTS. 

1 NCHODILATOR action of ISOPRENALINE. 

\ 4- (2-Hydroxy- 3- ISOPROPYLAMINOPROPOXY) acetanilide (practo 

^ propoxy) acetanilide ( PRACTOLOL) (20 mg/subject, i.v.), a S— ADRE 

j LS IN ARTERIAL OXYGEN TENSION FOLLOWING ISOPRENALINE IN AST 

{ se in ARTERIAL OXYGEN TENSION IN 11 ASTHMATIC patients following 



g ; Author entries 



Dlament ML, 820 
Hamilton WFD, 820 
Legge JS, 820 
Palmer KNV, 820 






Example Indexes. 




So 



Figure 11.1 (cont •) 
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INDEXING 

SYSTEM 



I 



Input 



Black Box 



Output 



Figure 11.2: The Indexing System. 
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Procedure : A procedure is a tui^ing machine.* 

Attribute : An attribute, A, ia a subset of the set of 

all possible observables associated with a 
procedure. Ac? Cc^f. Def. 2.2). 

With the addition of Definiti^j^s 2.3 and 2.4 (unit of measure and precision) 

the following definition of can be presented. 

Measurement ; Measurement is a procedure which; 

a) Isolates the attribute; 

})) Applies a unit of measure to the 
attribute; and 
t) specifies a precision. 

We shall define the result of j^^asureinent as data. Clearly, the data 
obtained must be a subset of ^et of sytnbols , F, associated with the 
turing machine that eirbodies tP® attribute under observation. Following 
Definition 2.5, then, a data ^j.^tnent ia the smallest datum in the class 
of data arising from the repe^f^^ci measurement of an attribute. Accordingly, 
a data element is the smallest ^atuci in a well-ordered (©/. Def. 2.8) 
set of data and serves as the ^’^ffeT^nt’ia tor class membership. 

From the derivation of th^ ^^finitioxi of data element, it should be 
clear that a data element can any desired entity* The essential point 
is that the specification of ^ ^ata element must be accompanied by the 
description of the associated ^P^esurement . This means that when referring 
to a data element, one also r^^^rs to the name of the attribute measured, 
the unit of measure and the p^^^^isic-n of measurement. Any omission yields 
a meaningless entity. For ex^^^JPle, the statement ^*the building is 21” is 
a meaningless statement since fs not defined. The building could 

just as easily be 21 years ol^ it might be 21 stories tall. On the 
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Notice that the term theo^l can not be defined as a set of definitions 
associated with a proceduv^* 
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other hand, to say that a wovd is a data element demands a meaningful specifi- 
cation such as: "a string of characters delimited by blanks/* The attribute 

is a string of characters, the unit of measure is a non-zero distance between 
blanks and the precision of measurement is the recognition of a character! 
Similarly, the data element ahapactev might be specified as ” a unique and 
unambigous pattern of bits of length six”. We could continue to cite examples, 
but it should now be clear that an infinite variety of data elements could 
be identified. Fortunately, the set of data elements which must be dealt 
with is finite since the measuring (recognition) devices associated with a 
given indexing or retrieval system have a finite (manageable) number of 
outputs . 

The possibility of a data element being any desired entity is a 
convenience both from a descriptive and a theoretical view point, for on 
the one hand it becomes possible to identify a continuum of data elements 
which includes characters, words, strings of words, titles, sentences, 
abstracts, full documents, numbers, frames of film, varying lengths of 
video and audio tape, to name but a few. On the other hand, the 
specification of an indexing method or system serves to define those data 
elements that can be recognized and subsequently processed by the system. 

The obverse of this statement is also valid: the specification of a data 

element defines those systems capable of representing it. Consider the 
following as an example of the defining role of data element. For most 
automatic classification arid indexing systems, the data element is defined 
as a word (key^^ord) ; however, if the data element is re-defined to be a 
character then a new theory of the classification process is obtained — 
error detection in spelling. 
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An understanding of the concept of data element (its definition is an 
extension from present-day, computer-orientated usage) leads one quickly 
into the concepts of document, document space, index, index space and 
query. These concepts will be treated in detail in subsequent sections, 
but an introduction to their importance is provided as follows. 

A document is viewed, by this author, as a well-ordered set of data 
elements. Although a data element can, as we have seen, be any desired 
entity, it is frequently associated with the definition of word, clause 
and/or sentence. These are the units of written-document communication. 

Data elements of this type are well ordered by their physical position 
or occurrence within the document. In addition, some subsets of the data 
elements of a document are well ordered with respect to membership in a 
classification hierarchy, where the ordering relationship is denoted by 
genus-species. Section 12 contains a more detailed discussion of the 
significance (and utility) of data-element relations. 

The input documents to the indexing system (and to the indexing process) 
are represented by the document space. These documents are ordered by their 
time of arrival at the indexing system, or, possibly, by subject content 
(a shared data element and/or relation). It should be clear that some 
subsets of documents in the document space can be well ordered. Furthermore, 
an important feature of the document space, as a collection of documents, 
is that some documents will contain identical data elements and data-element 
relationships. The recognition of document similarity is one of the 
essential functions of the indexing process; this process will be describ- 




ed in Section 13. 
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The index is the output from the indexing process and it is postulated 
to arise as a function of both the document and of the indexing process. 

Also, both the index and the description of the indexing process character-- 
ize the operation of the indexing system. We will just mention, at this 
point, that the indexing system is further characterized by a description 
of its index space which is itself a representation of those data elements 
and relations that can be recognized by the system. 

Hence, Theorem 2.1 (Proof): A representation of permitted data elements 

and relations is itself a well-ordered set 
of data elements, hence a document. The 
document space is viewed as an R-set 
(following Russell^s notation [8] — that is, 
it is a set that contains its own description) , 
hence its description, 3 , is a member of £f . 

It should be obvious, following the remarks of the previous paragraph, that 
if the index is the result of successive order-preserving mappings perform- 
ed on the document space, then the data elements contained in the index 
must preserve the original data— element/relation structure of the document 
space. 

Hence, Theorem 2.2 (Proof): An index, by definition, must preserve the 

well ordering of its parent documents and 
the ordering of the document space. Thus, 
an index is a well-ordered set of data 
elements (trivially well-ordered by 
alphabetization) . 

It follows therefore that the purpose both of the indexing process and 
of the creation of the index center on the representation and subsequent 
retrieval of documents. In fact, the fundamental assumption of this section 
is that accurate retrieval depends on the exactness of document representat- 
ion in the indexing process. Furthermore, it is intuitively reasonable 
to assume that a document should have the same representation both in 

So 
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storage and in retrieval. This observation is thus the basis of the definition 
of a query as a well-ordered set of data elements which is a proper subset 
of the index. 

Hence, Theorem 2.3 (Proof): if a query is a well-ordered set of data 

elements tlven it is, by definition, a 
document. 

12. Communication and Indexing -- II 
12.1 Communication 

Information Science is endowed with a multitude of models and definitions 
of communication. The various views of communication can be conveniently 
classed, following Weaver [9], as describing either technical, semantic or 
pragmatic information transfer. While such models are useful in the 
description of specialized modes of communication, we have chosen to intro- 
duce the communication function of indexing in terms of a more generalized 
view of communication. Cherry's [10] definition embodies such a general 
view: 

Communication . Broadly, the establishment of a social unit from 
individuals, by the use of language or signs. The sharing of 
common sets of rules, for various goal— seeking activities. 

The really important point brought to light by this definition is that 

communication involves the sharing of behavioral elements. From a 

sociological point of view, the sharing of behavioral elements leads to 

shared agreement, common understanding* and, finally, concerted action 

between the communicants. Furthermore, the concept of sharing is implied 

in the view that communication is the relationship between the transmission 

of stimuli and the evocatior.; of responses [11] . 
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The term common u nderstanding is due to Garfinkel [12] and is further 
explicated in Landry, Meara, Pepinsky, Rush and Young [13], 
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We have chosen to model the mechanism Cor everyday setting) which 
permits the sharing of behavioral elements by a closed system consisting of 
an effector Cthe source of the stimulus), a receptor (the recipient of the 
stimulus and the source of the response) , a transmission channel and a 
feedback unit (see Figure 3.1). It is postulated that the items which are 
transfered or ”communicated” are data elements and associated relations. 

An interesting confirmation of the above view of communication comes from 
Pierce's philosophy of Pragmatism. In the late nineteenth century, Pierce 
[14J posited the triadic nature of every sign situation. The triada were 
designated as ”sign-designatum-user” or, ”sign-’-that-v 7 hich--is-refered"-to- — 
user*’ and embodied the view that communication involves the expression of 
the 'intent of the sign. Consequently, a sign a collection of symbols) 

never stands in isolation, but must possess a relationship to other signs. 

The acknowledgement of the ’’understanding” of the intent of sign relation- 
ship comes from the feedback elicited from the original receiver. In 
Pierce's terms, this is the development of the sign. As a final observation, 
the definition of data element (Def. 2,5) is a triad involving the 
specification of the relationship between an attribute, a unit of measure 
and a precision. Thus, viewed as triads, data elements are the correct 
elements of communication, at least in the sense of Pierce. 

12.2 Experience Set 

The elements of communication are data elements. To go one step 
further it is assumed that the ’’memory” of the source and the receiver 
can be adequately modeled as an ordered set of data elements and relations. 

We shall refer to the ordered set as an expeT'ienoe set. Hence, 
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Theorem 3,1 CProof); The data elements and relations of the 

experience set are well-ordered with 
respect to their order of insertion into 
the "memory structure". Hence, by definition, 
the experience set is a document. 

Data elements and relations are selected from the source *s experience set 
for transmission and, upon reception, these same data elements and relations 
are evaluated in terms of the receiver's experience set. The data elements 
and relations selected for transmission constitute what I have called the 
interface experience set (lES) . In a sociological sense, the elements of 
the lES are the participant's "informative displays"*. As we have defined 
communication, it can involve transmission between any combination of men 
and machines. For example, in communication between a programmer and a 
computer the lES is some programming language (see Figure 10.1). 

An interesting alternate model of the experience set is provided by 
Mackay I15j . He argues that the pertinent states and relations (we call 
these data elements and relations) are represented by a conditional probability 
matrix (CPM) . The transition probabilities of the CPM indicate those 
relations recognized by the particular experience set. To Mackay, the 
meaning of a communicated data element can only be evaluated in terms of 
a change of state (or probability) in the CPM. Consequently, the source 
decides whether the receiver has properly interpreted the "meaning" of the 
communication by carefully observing its effect the response). In 

this way the source Cand, in return, the receiver) draws inferences about 
the CHM modification. The concept of a conditional probability matrix will 
be further developed in Chapter 5 by means of the construct called a 
hypothesis structure. 




"informative display." 



The minimuin condition for effective communication is that there be some 



overlap between the participant's experience sets. Overlap will permit an 
informative display signaled by one of the participants to be properly 
interpreted (and acted upon) by the other participant. Of course, the 
reliability of the interpretation depends on the degree of commonality 



between the respective experience sets and the lES , Hence, 
Theorem 3.2 CProof): 



Effective communication is maximal when 
there is commonality between the ES*s and 
the lES for all possible messages — 
when (ES) = (ES) . 



and, 



Theorem 3.3 CProof); 



The IS&R-system user cannot know precisely 
which data elements are stored in the system, 
however he must understand how the system 
stores, organizes and represents documents. 
Effective communication with the system is 
achieved when the system lES relations and 
transformations are known and understood by 
the user. 



12.3 Transmission Analysis and Indexing 

To this point we have been concerned with a generalized presentation 
of the concept of communication. Although somewhat esoteric in nature, such 
a discussion provides a theoretical basis for a consideration of the 
essential role of transmission representation and indexing. Namely, we 
postulate that any theory or practice of communication which causes a loss 
of data elements, either through their misrepresentation or by restricting 



their flow, must be considered Inadequate, 

Accordingly, the problem becomes one of representing messages that 
come from a number of unrelated sources. The initial collection of these 
messages forms what we have labeled the document space. It should be 
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obvious that message (document) representation must be effected so as to 
guarantee the TnaYimntn degree of overlap between the experience set that 
-is the representation and the experience sets of the class of potential 
receivers (searchers). In this way, the "meaning" or intent of the document 
will be preserved. 

The particular document representation that is employed must serve 

two distinct functions? 1) it must allow for the creation of "stores" of 

document content, and 2) it must provide a basis for search operations. 

This type of representational activity is implicit in Graziano's Il7] view 

of the process of documentation (information storage and retrieval) : 

...the operational methods of identifying elements, distinguish- 
ing elements from each other and for transmitting sets of patterns 
from one time and/or place to another in such a way so as not to 
destroy the power of the s 3 nnbols to convey exact concepts. 

The IS&R representational activity, as described above, must be concerned 
not only with what the document says the message proper) but must 

be concerned with what the document is about (i.e., content analysis). 

Since IS&R systems store data and retrieve information, it is the purpose 
of the system to effect (permit) the transformations between data and 
information. Obviously, the ability to effect these transformations depends 
on the fidelity of the representation. 

Following Fairthome [18] it is believed that the maximal represent- 
ation of a document depends on the number of distinct configurations* 
that can be observed in it. Transformations are employed to reduce the 



* This author believes that there exist a small set of structures (e.g"., 
patterns of data elements) in a language. Thus there exist a finite 
number of relations, so that the variability one observes in a language 
is only achieved through data-element substitution. 

tz 
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redundancy of the data-elements in the document. (This does not imply 
that the code used to represent the data content will be shorter— indeed, 
the shorter the code the less structural information is preserved.) These 
transformations involve an order-preserving representation of the document. 
Included are representation by aonrpression (i.e.j the preparation of abstracts 
and extracts) and representation by syTftbo'L'i^o suhswiwt'i-O'yi C'Z..6.j the 
creation of index entries through the use of thesauri, word control lists, 



etc.). The main observation, at this point, is that indexing performs a 
communication function. We postulate that the index provides che lES and 
the transformations required for effective communication. We must now 
consider the nature of indexing, the inherent drawbacks of present day 
real-world indexing, and finally, the types of relations and transformat- 
ions required to create the lES . 

Compare these two statements as descriptive of the nature of the 
indexing problem: 

No person who is engaged in the work of extracting information 
from printed sources... can fail to be aware of the frustration 
constantly presented by know’ing that the information exists v/ith— 
out knowing where it exists » [19] 

What constitutes a good index? The test is to determine whether 
or not an index will serve as a reliable means for the location, 
with a minimum of effort, of every bit of information (sic) in 
the source covered which, according to the indexing basis, that 
source contains. To meet this test an index must be accurate, 
complete, sufficiently precise in the information supplied, and 
so planned and arranged as to be convenient to use. [20] 

Of course the ideal system xrould store the complete document, and each 

stored document would be searched in response to every request. Since 

this is a practical impossibility, a representation of the document is 

effected through indexing. Grems and Fisher [21] have provided an interest- 
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of the nature of indexing and retrieval. 

presented here in tabular form 5 

Retrieval 

subj ective 
synthesis 
personal 
heuristic 

We will dwell at some length on the nature of these retrieval character- 
istics in Chapter 5. In any event, indexing i:3 viewed as an algorxthmic 
process for producing a docximent surrogate. 

12.4 Indexing Failures 

Indexes range in size from a few entries to entire sets of volumes. 
However, one should not make the false assumption that the quality of 
indexing is comensurate with the number of index entries chosen per 
document. Usually error is Introduced in the creation of the index 
through limitations of the representational vocabulary. Under the con- 
straint of a controlled vocabulary either the system totally lacks the 
power to describe the contents of a document, or, if means are available, 
they lack the required precision of description. In either case, most 
systems (either manual or automatic) force the user to supply alternate 
index entries.. This occasions a lengthy index search for the satisfact- 
ion of an information need.* 

Mellon [22] cautions: 

The searcher must guard against relying too heavily on the 
indexes. Too often they merely index titles or words, and 
at best they probably never contain entries for all of the 
important points covered by the articles. 



ing characterization and comparison 

The essence of their description is 

Indexing 

obj ective 
analysis 
impersonal 
algorithmic 




The concept of information need is defined and discussed in Chapter 5, 
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If the user adopts this negative view of the index, thexi he is left to 
his own devices to supplement or to supplant its contents. As Skolnik [23] 
reports, chemists frequently supplement available indexes by personal in- 
depth card files. In desperation, some researchers adopt the "random scan 
technique" of covering indexes and documents in an effort to find important 
items that have not been properly indexed Cor else have been totally 
ignored) . A case history which typifies the problem is given in Appendix 
A^ page 136. 

Apparently, in the current process of indexing, a document is viewed 
as a collection of a few "important" concepts, (The word important is 
placed in quotes because importance as determined by a system is likely 
to be considerably different from that determined by a user based on his 
experience set.) Once these "important" concepts have been identified, 
they are given labels and placed into an ordered list together with similar 
concepts from other documents. Ordering is based upon commonality of data 
elements with virtually no regard for relations shared by them. By example, 
a back-of-the--book index can be viewed as an alphabetically arranged 
collection of N ordered pairs of index terms and addresses. These entries 
correspond to large sections of text, causing potentially Important 
information to be lost because of a lack of index terms which refer to 
specific data elements within the section. In addition, index entries 
rarely refer to all of the occurrences of a data element; rather they 
represent the (often implicit) Imposition of a gross classification scheme 
on them. 

Consider, for example, that there are 620 pages of text in Pauling’s 
The Nature of the Chemical Bond [24] and 19 pages of index (both subject 
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and author), or stated differently, approximately 220,000 words of text 

and 2,100 index entries. The assumption that one can retrieve data from 

the text using the back-of-the~book. index is a tenable as the assumption 

that only 2,100 of Pauling's words are of consequence, the remaining 

218,000 words serving simply as filler. Yet, when such indexes are 

discussed and created, this is the assumption which is made in every case. 

In order to show that the phenomenon exemplified by Pauling's book 

was general in nature the following experiment was performed. Eleven 

texts were selected at random from various fields. A chapter from each 

was then selected randomly, and the text types* contained in the chapter 

were identified. Those that appeared in the back~of-the-book index 

(index types; index entries \^ich refered to the chapter in question) were 

then counted and the index- type/ text-type ratio was calculated (see 

Table 12.1). These ratios cluster around 3 percent. Interestingly, the 

number of single entries (non-faceted) accounted for approximately 50 

percent of the index entries associated with the chapters in question. 

The total index-size/book-size ratio was, on the average, 0.6 percent, 

Geballe [25] in a recent review of The MoGvccbJ-B^ll EnGyoto'ped'ta of 

Sc-ience and Technology (containing 120,000 index entries and 15.8 entries/ 

document) faulted the index for its treatment (or non- treatment) of 

synon 3 nns and lack of uniformity in cross-indexing. He concludes 126] 

...no editor used a wide-angle lens. The indexing appears 
to have been accomplished in a mechanical fashion; it 
suffers from a kind of aimlessness and inattention to 
overall considerations. 




* A text type is defined as a word of the language. 
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TEXT SAMPLE 


WORDS (TYPE) IN TEXT 

1 

j 


TEXT TYPES IN THE INDEX 


INDEX-TYPE/TEXT-TYPE 


SINGLE ENTRY TOKENS 


SIZE OF INDEX 


SIZE OF BOOK 


INDEX-SIZE/BOOK-SIZE | 


A 


1389 


66 


0.047 


50 


826 


140,000 


0.006 


B 


1300 


42 


0.032 


20 


272 


59,600 


0.005 


C 


952 


24 


0.025 


11 


1398 


144,000 


0.009 


D 


1268 


132 


0.104 


48 


801 


76,800 


0.01 


E 


694 


3 


0.004 


1 


156 


45,500 


0.003 


F 


1342 


25 


0.018 


12 


194 


68,400 


0.003 


G 


1225 


96 


0.078 


14 


381 


56,700 


0.007 



Table 12.1: A Study of Text and Index Tokens 





Text Sample References 



A) H. Borko, Automated Language Fvooessing^ John Wiley, 1969 

Chapter 4 

B) P.M., Fitts and M.I, Posner, Hunan Performance ^ 
Brooks/Cole, 1967 

Chapter 3 

C) P.L. Garvin, Natural Language and the Computer^ 
McGraw-Hill, 1963 

Chapter 11 

D) J.R. Sharp, Some Fundamentals of Information Storage and 
Retirievalt London House and Maxwell 

Chapter 4 

E) D.A. Bell, Intelligent Machines^ Blaisdel Scientific 
Paperback, 1964 

Chapter 8 

F) D. Lefkovitz, File Structures for On-Line Systems ^ 

Spartan Books, 1969 

Chapter 4 

G) S. Artandi, An Introduction to Computers in Information 
Science, Scarecrow Press, 1968 

Chapter 3 



Table 12.1(cont.); Study of Text and Index Tokens. 
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It is concluded that current indexing practices serve not only to 
eliminate many of the concepts in the document, but also to destroy many 
of the relationships between the concepts which are selected for the index. 

12.5 Representational Relations 

The number and type of documentary relations employed in IS&R 
activities tend to reflect our general lack of understanding of the functions 
of language. This means that data-element relations are employed only as 
aids in the representational and Indexing process, and serve to contribute 
to the complexity of the many information retrieval languages rather than 
to facilitate a searcher ^s interaction with the IS&R system. The point is 
that the identification of data element relations allows for the specifi- 
cation of a document structure which reflects the homomorphic represent- 
ational operations discussed in Section 12.3. 

TWO broad classes of relations can be identified: semantic and 

statistical. Statistical relations are characterized by data element type 
and token counts and frequency of occurrence values. We have characterized 



the semantic relations by five classes of relations (see Def. 3.7-3.12): 



equivalence, generic-specific, part-whole, difference and Intensional. 

We might add to this list what Levdry 127] calls the relation of "nearness” 
or data-element proximity. Sometimes this relation takes the form of the 
identification of related terms concept clustering) and sometimes 

it takes the form of the identification of contextual envvronment* 
Unfortunately this relation is at best poorly defined and serves mainly 
as a symptom of the linguistic short-comings mentioned above. The relations 
of Definitions 3.7-3.12 are, to use DeSaussure^s terminology 128], defined 
in absentia while the relation of "nearness” is defined (recognized) 
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in -praesentia. Alternatively, we can call the former relations 

paradigmatic (the identification of patterns of data elements characterizes 

the & transformation - see Def. 3.13) and the latter syntagmatia. Whether 

paradigmatic or syntagmatic relations are employed in the indexing language 

is really a function of the current state of knowledge. The goal is total 

document content analysis, with respect to the other documents in oD*, so 

that the relations may be characterized as Fa.irthorne describes them 129] : 

Parts of a document are not always about what the entire 
document is about, nor is a document usually about the 
sum of things it mentions, A document is a unit of 
discourse, and its component statements must be considered 
in the light of why this unit has been adluired or requested. 

It should be clear that the specif ication of data-element relations 
is an order-defining transformation of D. Thus the order-defining trans- 
formation, Gy specifies which d e D are mapped into REL. Hence, 

Theorem 3.4 (Proof): The function, <3'’, creates equivalence 

classes of data elements with respect 
to the relations in REL. Thus^ o 
partitions D. 

similarly. 

Theorem 3.5 (Proof): Documents in ^are partitioned by data^ 

element membership in the equivalence 
classes defined by REL. 

13. A Further Specification of the Indexing System 

Let us briefly review the material that has bean presented in the 
previous two sections. In Section 11 we considered the nature of the 
concept of data element and touched upon its relation to Information 
Storage and Retrieval. An example document together with several forms 
of index entries were presented in Figure 11,1. The correlation between 
the document and the resulting index entries was modeled by means of the 
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indexing system. Figure 11.2 equated the indexing system to a black box 
that receives documents as inputs and produced indexes as outputs. In 
Section 12 we consider the nature of communication and information transfer. 
The concepts of data element, experience set and interface experience set 
were discussed and the task of the analysis and the representation of the 
transmission was delegated to the indexing system. We equated the index- 
ing process to the interface experience set (see Figure 10.1) and then 
considered the nature of current indexing-communication failures. Finally, 
potential intradocument/data-element relations were discussed. Attention 
is now directed to the role and the position of the indexing system in 
the communication process. 

In this section we shall differentiate between indexing, the indexing 
process and the indexing system. The object of the indexing process y as 
was implied in Section 12, is to provide a structure to represent the 
various orders of the data elements in the input documents. These data 
elements are usually accepted by the indexing process in the form of natural 
language strings. Consequently, for a given data element, the indexing 
process must represent the following items; 

• the data element itself 

• the surrounding data elements (context) 

• the order of the surrounding data elements (syntax) 

• relations (from REL) to other data elements (semantics) 

The function called the indexing process is descriptive of the internal 
operation of the indexing system. Documents are input to the indexing 
system; this system controls the application of the indexing process which 
performs order-preserving transformations to represent the data elements 




94 




and relations between data elements found in the input documents; and, 
finally, an index is generated as output. To perform these functions, 
the indexing system must reside intermediary between the transmission 
channel and the receiver. This is illustrated in Figure 4.1 by means of 
an adaptation of the Shannon and Weaver communication schema. 

A feedback function is included in this adaptation of the Shannon and 
Weaver model in order to depict the view of communication represented by 
Figure 3.1. The index, by means of a citation or accession number, enables 
the receiver to retrieve the source’s document and thus complete the 
communication loop. Notice that the transmission channel, the indexing 
system and the feedback function are all potentially affected by noise. 
These errors represent, respectively, document transmission error (possibly 
encoding error), indexing process representation error, and receiver mis- 
interpretation of the source document. Based on the observations of 
Section 12.4 it is postulated that the most significant error results from 
an alteration of data-element order by the indexing process. Thus, the 
error associated with current indexing practices serves to obscure the 
unique organization between data elements in documents. 

The input to the indexing system is characterized as a document stream 



consisting of previously unrelated documents. The indexing system processes 
this stream in fixed intervals of time, called time slices. We assume that 



the time required to process a time slice is significantly less than the 
time required to process the entire document. In both manual and auk.omated 



systems, the bibliographic citation, introduction, body, tables, figures. 



conclusion and references all supply different kinds of data 
must, accordingly, be isolated and processed separately. Of 



elements and 
course, a 




must , 
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given type (and value) of data element may appear in several locations in 
a document; consequently the indexing system must recognize common data 
elements and relations both within and between time slices. Hence, 

Theorem 4.1 (Proof): In the document stream, document boundaries 

are just a type of data element, hence 
parts of more than one document may appear 
in a given time slice. Since the indexing 
system is able to recognize data elements 
and relations both within and betv?een time 
slices, it can recognize inter- and intra- 
document data element relationships . 

In a manual indexing system, an indexer (a component of the indexing system) 
is considered excellent (other things being equal) if he cuts across 
document boundaries when producing index entries. This is because the 
information he needs to make correct decisions about data-element values 
and relations is usually not contained s ', n single document. In order to 
cut across document boundaries (that is, to process all data elements and 
relations in a time slice), the indexer must make use of, among other things, 
the very index he is generating. It is for this reason that adequate 
(perhaps we should say intelligent) automated indexing systems have seldom 
(if ever) been developed. 

The role of indexing (the indexing process and system) is to completely 
specify the data elements and relations in Che document stream by means of 
order-preserving transformations. The document representation provided by 
the indexing process is a homomorphic reduction, or many-to-one mapping, 
from document stream to index. Hence, 

Theorem 4.2 (Proof): The reduction transformations preserve the 

data element and relation order of the 
document stream, hence they are reversible. 
Document stream reconstruction is possible 
up to the specification of data-element 
order . 
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14 . The Index as a Bi -Pi recti anal Interfac_e_ 

The indexing system provides the transformations and the interface 
experience set required for effective communication between the source (s) 
and the receiver. This means that data elements and relations between 
data elements must be identified and represented by means of order- 
preserving transformations. We have assumed, from our theoretical view 
of the indexing process, that such transformations completely specify the 
content of the documents in the document space , ^ . The index is the 
end product of all of this activity; namely, the index is the image of 
composite order-preserving mappings performed on . The crucial point 
to realize is that not only is the index the product, but that it is all 
that remains of the original document space. We assume that the original 
documents are not directly available to the receiver, hence the index is 
the receiver's only point of access to the doaoment collection. Under 
such constraints it should be clear that accurate retrieval depends on the 
exactness of the indexing. In other words, the indexing system must produce 
an index that is a facsimile of the document space. Hence, 

Theorem 5.1 (Proof): Inaccurate or incomplete document space 

representation will lead to retrieval error 
since the index is the receiver's only point 
of access to the document collection. The 
index, and the indexing system are the only 
intermediaries between <d and the receiver, 
hence reliability of document representation 
is the function of the indexing system.* 

Reliability is partially achieved through the completeness of the index 
entry. Bernier [30] makes this point clear: 

* For further amplification of this point, see Appendix A pag® Hu- 

■ 10 ? 



97 



There is not so much information in an index entry or vocabulary 
terms as in the document or parts of a document that it represents. 
Because of the greater context and meaning of an index entry head- 
ing and modification (modifying phrase) than of a term or word, 
the complete index entry serves more effectively as a guide to the 
information than does a single word or term. 

However, the indexing system cannot assume that all (or any) statements 
in a document contain information— indeed, a document is just an author- 
assembled collection of data elements. We infer that the data elements of 
the document become information when they are assimilated or put to use by 
the receiver(s). Consequently, information is defined as "data elements 
of value in decision making" (adapted from Yovits and Ernst [31]). The 
index, and the subsequently retrieved documents, must provide data elements 
at the proper time and in the proper form to be of value in the decision- 
making process . 

Prior to a consideration of the indexing system transformations and 
the index space, an overview of the concept of information is presented. 
This discussion is not only applicable to this section but is also 
preparatory to the topics to be presented in Sections 15, 16 and 18. 

14.1 Information 

We shall consider two approaches to the definition of information 
which was presented above and in Section 5 of this Chapter. First, 
information is defined from an organizational/operational viewpoint and, 
second, information is defined as an extension of the concepts of turing 
machine and procedure outlined in Section 11. 

Briefly, we derive information from the world about us by performing 
a set of operations on an object under study. The result of these 
operations is a selection of a subset from the set of alternatives that 
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was available prior to the application of the operations. The operations 
are the experiment and the subset of alternatives is the resulting measure- 
ment. This is an informational description of the operational processes 
of science. Information is obtained through the reduction of the number of 
alternatives available to describe the object under study (the number of 
alternatives available before the measurement is the 'pv&G’US'Uon of the 
measurement). As expressed by Brillouin 132], information is the logarithm 
of the ratio of the a posteHoort number of alternative values, to the 
a priori, number of alternative values, A^: 



A 




However faithful this measure is to the statistical-mechanical conceptuali- 
zation of information, it tells us nothing about the quality or usefulness 
of the derived information. In the real world there is, for one thing, a 
non-equality between alternatives; thus, a better way of evaluating 
experimental results is desired. 

There are at least four different classes of information. They 
include; 1) technical or communication-theoretic information (Shannon 
[32]); 2) semantic infomation (Carnap and Bar-Hillel [34]); 3) pragmatic 
or effectiveness infoirmation (Yovits and Ernst [35]) and; 4) inferential 
or experimental information (including, Shannon [36] as an informacional 
measure of the mean, Fisher [37] as a measure of the variance, Kullback [38j 
as an informational measure of the confidence in alternate hypotheses 
about the value of the mean) , We shall direct our attention to a measure 
of information that incorporates the concept of the use and the effective- 
ness of information. To introduce the measure,, the concepts of course of 
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action^ deoisi-on making and decision are briefly discussed. 

Intuitively, a course of action can be interpreted as a planned 
sequence of responses to an anticipated set of stimuli. Thus, a course 
of action can be defined as a well-ordered set of stimulus-response pairs 
that are directed toward the attainment of a goal. A course of action is 
specified by the enumeration of the following; the set of inputs that it 
can process, the set of states associated with the processor, and the next 
state and output functions. Of course this is the definition of a turing 
machine and, consequently, a course of action can be equated to a procedure. 
It is possible that alternative well-ordered sets of responses may exist 
for the achievement of the same goal. Thus, under the constraint that 
only one course of action may be effected during a prescribed interval 
of time, a choice must be made between the alternatives. This choice must 
take into account both the present state of the system (the system is 
that which executes the course of action) and the present inputs (.e,g,^ 
course of action). The execution of the choice may involve several 
sequential inputs and several intermediary outputs, consequently next state 
and next output descriptions must be provided. A definition of choice 
then, amounts to a definition of a turing machine. We shall call the 
process of choosing between alternative courses of action decision making. 
Based upon the above characterization, the final output from a 
decision-making procedure is the selection of a course of action for sub- 
sequent execution. This final output is called a decision. The decision 
j which is output is described by the relation, o; K. x p where P^ 

I denotes the set of alternative courses of action. We shall demand that 

if < , 

t 

I the state transformations associated with this choice result in the 
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attainment of a final state, hence the decision-making procedure will 
halt.* Finally, the input symbols to the decision-making procedure which 
lead to a final state and to the choice of a course of action are defined 
as information. Since the input S 3 raibols are data to the operation of the 
turing machine f.e.j to the decision-making procedure, information is also 
defined as data of value in decision making. It should now be evident 
that information is context sensitive** since those data leading to a final 
state depend on the starting state and the sequence of inputs to the 
decision-making procedure. 

The connection between data, information, course of action and 
decision making is conveniently modeled in the Yovits and Ernst I39J 
description of the information transfer process (see Figure 14,1), Notice 
that the observables that result from the execution of the course of 
action eventually become new data for the information (really data) 
acquisition function in the model. It is believed that this interpretation 
of the information transfer process embodies the desirable measures of 
information use and effectiveness. 

Finally, it should be noted that the indexing system is really a model 
for the information acquisition box in F/^vrure l4.1. Data must be carefully 
identified and represented so that the particular decision-making context 
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* Giving us, therefore, an algorithm for effectively making decisions. 

** A.D. deGroot [41] has shown that after a short stimulus period, a chess- 
master can easily reconstruct the chess board arrangement shown to him, 
whereas a novice finds the task almost impossible. It is hypothesized 
that the Master stores the information about the board in the form of 
relations between the pieces, rather than in the form of a complete scan. 
The relational context creates a non-equality among the probabilities of 
the alternative arrangements, thus, there is no "information overload." 
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Environment Environment 




Figure 14.1: The Yovits/Ernst Model of Information Transfer, 

(from [31]) 
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will define the required information. Fairthorne [401 has similarly 
observed that mathematical statements are just data — "the information 
that one reads into a mmerical result often involves the semantic field 
of a particular application." 

14.2 Indexing System Transformations 

The indexing system effects two distinct types of transformations 

upon the input documents. First, data elements and relations present in 

the document must be expressed in terms of the system’s data elements and 

relations. This transformation is effected through use of the analysis 

documents , Secondly, the system effects a transformation (denoted 

d. 

here by the letter g) on the data element representation to create the 
index entry. Variations in g yield different types of index entries. Thus, 
the form of the index is specified by this second transformation. Let us 
consider each of these transformations in turn. 

The document space, in Section 5, was described as the union of two 
subspaces: the input documents and the analysis documents. The input 

documents constitute the document stream; oD- ^ is continually changing. 
However , is created by the system (or, for the system) and is assumed 

d 

to change at a rate which is much less than the rate of flow of documents 
in the document stream. Analysis documents take the form of^ classification 
hierarchies, word guides*, vocabularies, lists of formulae, syntactic 
classes, eta. In other words, the analysis documents are the embodiment 
of the system's representational rules, and amount to the system's 
realization of the set of relations, REL. For illustration, consider the 
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* A word guide is the only reasonable extension of the concept of thesaurus , 
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sample document presented in Figure 11.1. The title "Effect of a selective 
beta-adrenergic blocker in preventing falls in arterial oxygen tension 
following isoprenaline in asthmatic subjects" is input to the indexing system 
(let's say that it appears in the first time slice) and a representation of 
the title is effected by the indexing system. Figure 14.2 shows the data 
element transformations that have been effected by documents in <• 

In this example ■ff. consists of role indicators (T) , generic-specific 
relations (G) , formula list (E) and a word guide or controlled vocabulary 
(T) . The indexing system representation of the title might take the form: 

"R^g of a selective beta-adrenergic receptor (beta receptor) 
blocking drug (drug) in Rg in arterial (cardiovascular 
system) oxygentation tension (airway resistance) R^q isoprenaline 
^^14^22^2^3^ in asthma subjects." 

It is obvious that this representation is a composite of . and 

X H 

and since the data-element/relation/data-element triplets of 
incorporate, relations from REL, the representation can be expressed as 
(o^.) • REL (Theorem 5.3). 

Once the' input document representation is effected (-i.e,^ after 
several input-time-slice operations) , then the indexing sy£ cem applies 
the index-entry generation function, g. Following the model shown in 
Figure 11.1, the particular form of the index entry depends upon which 
data elements and relations are selected from the representation (see 
the index entry examples in Figure 11.1). Hence, 



Theorem 5,4 (Proof) 



The index entry function, by definition, 
effects a transformation on the represent- 
ation (JT ® 0 &*. ) and this transformation 

3 X 

is an order and relation preserving (homo- 
morphic) mapping. This transformation is 

represented by I = g(c&' ® C&-,). 

3 X 
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Text ; 

Effect of a selective beta-adrenergic blocker in 
preventing falls in arterial oxygen tension following 
iscprenaline in asthmatic subjects. 



(Input) 


(System Representation) 




ita Element 


Data Element 


Effect 


^19 


Role 


beta-adrenergic 


bet a- receptor 


BT 


adrenergic 


adrenergic- receptor 


USE 


blocker 


blocking-drug 


USE 




drug 


BT 


prevention 




Role 


falls 


reduction 


USE 


reduction 


^3 


Role 


arterial 


cardiovascular system 


BT 


arterial oxygen 


arterial oxygentation 


USE 


arterial oxygen 


tension airway resistance 


RT 


following 


after 


USE 


after 


So 


Role 


isoprcualine 


S4^22^2°3 


FORMULA 


asthmatic 


as thma 


USE 



Figure 14.2;. Data Element Transformations Effected by 
the Application of cSTa. 
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Once the individual index entries (Def . 5.4) have been generated, then 

the physical form of the index depends only on the particular manner of 

index-entry ordering, source-document citation, repeated heading s election, 

eta. Also a new index is just an update of the old version, hence: 

Theorem 5.5 (Proof): New index entries are created by the 

application of the indexing process to 
new documents in Assuming that 

the index entry function, g , has not 
changed, then the new index is simply 
the union of the old index and the new 
entries , 

14.3 The Index Space and Re tr i eval 

The end product of the indexing process is the creation of the index 

entry and, finally, the index. Lancaster [42] provides a description of 

the "ideal" index entry vocabulary: 

Ideally, an entry vocabulary should contain all words and phrases 
used in input documents to express items of subject matter that 
have been recognized in the conceptual analysis stage of indexing. 

The entry vocabulary will refer to the code terms used to express 
this subject matter. 

However, our discussion of the indexing system has given no clue as to 
whether any specific operating system can provide such an entry vocabulary. 
Clearly, a means of characterizing the operating level of the indexing 
system is required. We shall describe the operation of an indexing system 
by means o f the i-ndex sp<zoe , 3 . * 

In Definition 2.11, the index space was initially described as a 
representation of the data elements and relations found in the indexing 



* Maron [43] has used the cerm index space in reference to an 

n-dimensional space of vocabulary terms, where connections represent 
sharred relationships. This is analogous to the document/term space 
described in Chapter 3. We shall, rather, restrict the concept of 
an index space to a 3-space. 




106 



system. The g-transf ormation and the o&'^-based transformations that were 
discussed in Section 14.2 can, obviously, accomodate a wide range of data 
element descriptions and relations. In fact, the variability of the 
vocabulary and of expressions that characterize the spoken and written 
languages applies equally to the operation of the indexing system. 
Consequently, we postulate that the indexing system is best characterized 
by the vocabulary, productions and expressions that it can recognize and 
subsequently represent. Accordingly, the index space is defined as a 
triple formed by the cross product of vocabuisiry, transmission decoding 
and language. 

The vocabulary, V, is a finite set of possible data elements each of 
which defines an equivalence class of symbols. These data elements can 
be ordered by precision of measurement. Examples of the resulting data 
element continuum are found in Bernier's classes of "microsemantics and 
"macrosemantics": punctuation, symbols, suffixes, words, phrases, clauses, 

sentences, paragraphs, pages, chapters, sections, reports, books, 
collections... The actual vocabulary elements will be specific characters, 
words, suffixes, etc., frequently given as a document in Subsets 

V c: V of the vocabulary continuum (not necessarily continguous subsets) 

i“ 

describe those data elements that can be recognized by the system. 

Transmission decoding, TD, is a set of productions (rules) which define 
"recognizable" strings of data elements taken from the subsets, V^. For 
example, <word><word> or <word><formula> would be considered as acceptable 
input or output strings, however, <formula><f ormula> might be labeled as 
unacceptable. Subsets TD^ £ TD of this continuum of possible data element 
productions describe the productions actually employed by a given indexing 



systGin# TliGS ® productions o.v& espociaiiy usciuX for tlio cliaractorization 
of the permitted data-element syntax in an index entry. 

Language, L, is a set of possible index entry expressions. These 
expressions are built from strings over V, defined by TD^ , and from relations 
from REL. An example of the continuum of expressions is offered by Meadow’s 
[44] continuum of indexing languages: hierarchical, subject heading, 

keyword, tagged descriptor, faceted term, phrases, natural language. Each 
of these languages defines a different form of index entry, especially 
when an index entry is viewed as an ordered set of data elements and 
relations (see Def, 5.4). 

As we shall see, the concept of the index space provides a useful 
framework for analysing the retrieval process. Recall that the concept of 
query, Q, was defined (Def. 2.13) as a well-ordered set of data elements, 
such that Q c I. From the previous discussion it should be clear that 
either this is an idealized statement or else Q represents the query as 
finally accepted by the system.* Experience tells us that the latter is 
the case. A receiver's initial query will not immediately be acceptable 
to the retrieval system — indeed, the problem amounts to one of matching 
the user’s "conceptual" terms with the system’s fixed scheme of document 
representation, that is, of putting the query into a form acceptable to 
the system. 

A query may deal with either specific data elements or complex 
combinations of data elements and relations. Typically a query takes the. 



* We assume that, the indexing system provides for both the represent- 
ation and the retrieval of documents — thus it can be called the 
"retrieval" system. 
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form "What are the physical and chemical properties of compound X?" or, 

"How does one convert compound X into compound Y? The initial goal of 
any system/query interaction is to bring the two vocabularies into 
coincidence, which means that common data elements and relations must be 
discovered (the process of discovery in retrieval is the main topic of 
Chapter 5) . The user will not necessarily know that the index lists 
Pentylbenzene under "Benzene, pentyl-" or that Hexylbenzenes are listed 
under "Hexane, phenyl-" 145]; consequently, several interactions with the 
index are required before user and system achieve coincidence of expression. 
To the designers of the indexing system the disposition of the index space 
ie.g.j which V^, TD^ and L are implemented) is clear; however, to the user, 
the exact nature of the index (his conceptualization of the index space) 
appears to be fuzzy. This situation accounts for the several interactions 
required to bring the user’s query expression into a form compatible with 

the index and the indexing system. Hence, 

Theorem 5.6 (Proof): Compatibility between query and indexing 

system means that the query and index share 
the same vocabulary, productions and 
expressions, or, Q C J 

Theorem 5.6 is interpreted to mean that the data-element ordering and 
relations present in the query must also be present in the index. 
Consequently, retrieval is viewed as a homomorphic mapping from the request 
into the index spaced Hence, 

Theorem 5.7 (Proof): From Theorem 5.6 and Definition 5.9, we 

know that the relations present in the 
query must be the same as tl 'e in the 
index and those which are de, -d by the 
index space. The indexing process and 
its reversibility (Thms . 4.1 and 
account for the mapping I • 
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15. The Inaexing Syst:em as a Phase Space 

In the previous discussion we have assumed that the indexing system 
is always present and functioning, but for the sake of contrast, consider 
the extreme case of an IS&R system without an indexing subsystem. In such 
a systera, input documents would simply be stored by their order of arrival 
at the system. A user of such a system would be forced to conduct an 
exhaustive, sequential scan of the entire collection in response to his 
every ’’information need.’’ Such a system would either have a small (drawer 
sized) collection of documents, or a fully automated time independent 
processor or, more likely, a vanishingly small group of users. This hypo- 
thetical situation represents the case where data elements are to be 
located in a collection about which there is nc prior knowledge concern^ ' 
ing its contents. We have previously postulated that effective IS&R 
system operation presupposes some manner of organizational scheme for 
document representation. Fairthorne [46] reminds us that the needed 
organizational scheme is not simply a communication-engineering problem: 

The communication engineer is not concerned with completed messages, 
but how to deal with bits of in the course communication. 

The IS&R specialist [rather] deals with spatial c ^llections of 
completed messages and, after recopnition and identification, 
questions of their ordering and disordering predominate. 

The IS&R system must effect some form of organization of the input documents 

so as to maintain ’’coverage” and to provide a manageable sea^ ch time. We 

postulate that this organization of the document space is provided by the 

indexing system by means of its recognition and representation of inter- 

and intra-document data-element relations. Consequently, the probability 

of a given set of data elements becoming information (recall the discussion 

in Section 14.1) is a function of the work expended by the indexing system. 
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Symbolically, the potential utility of data elements (with respect to their 
information content) after indexing, PU^, is the sum of the potential utility 
before indexing, > and the work e'-pended by the indexing process, W: 

PU. == PU. , + W 
1 1-1 

We shall now consider how the representational operations of the indexing 
system can be modeled by theirmodynamic concepts and, how such considerations 
introduce the concept of "information benefit." First, a brief overview of 
thermodynamics is presented.* 

Thermodynamics is concerned with the energy description of well defined 
systems. More Specifically, thermodynamics is the study of the relationship 
between heat and work. In the characterization of the indexing system we 
shall be concerned with either open systems (systems that exchange heat and 
matter with tnair environment) or adiabatic systems (no exchange with the 
environment). Thermodynamic parameters include the following; entropy, mass, 
energy, volume, temperature, and pressure. The specification of a value for 
the parameters denotes the state of the system. What is important to this 
study is that the parametric structure is assumed and the alternative values 
(states) are unknown before measurement. 

Thermodynamic systems are conveniently modeled by statistical mechanics. 
Statistical mechanics accounts for thermodynamic properties (microscopic or 
macroscopic) by considering a system as a collection of particles (i.e., gas 
molecules) subject to the laws of motion. Measurements on thermodynamic 
systems are postulated to be performed on a phase space composed of 2n 
dimensions (n positional coordinates and n momentum coordinates) . The 




* An excellent discussion of the relationship between energy and infomation- 
has very recently appeared [47] to which the reader is referred for a more 
detailed treatment of this subje<3:A:- 
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specification of the state of a particle, or of the state of the system, is 
analogous to the identification of a point in this phase space. Caratheodpry 's 
principle [48] posits that certain adiabatic state transformations are 
Impossible, hence there is a natural partitioning of the phase space. The 
resultant partitions are identified as equivalence classes of states determin- 
ed by adiabatic transformations. The macroscopic property, entropy, is 
assumed to be constant for each equivalence class. Consequently, entropy 
measures the amount of missing microscopic information (e.g., which state 
is occupied) given the energy of the systein [49] . The important point is 
that while the structure of the phase space is known, a priori, entropy is a 
measure of the uncertainty of the state value. 

The indexing system, with respect to the document space, must be treated 
as an open system; however, the indexing process is assumed to be effected 
within an adiabatic system. The phase space associated with the indexing 
system is a space of n dimensions corresponding to the n data elements recogniz- 
ed by the system. The "configurational" coordinates are those data elements 
which chare -..terize documents in c0* . The "momentum" coordinates, as we will 
see later, correspond to the concept of the index seciPoh- The analogue of 
Caratheodory 's principle is the equivalence of data elements as manifested 
through shared relationshipc (from REL) between data elements. In a real 
sense, the indexing phase space corresponds to the range cf index entry 
assignments permitted within the indexing system, hence ^ j 

Theorem 6.1 (Proof): A point in phase space represents a data / 

element type, and the partitioning of the j 
phase space amounts to a specification of 
the allovred data-element expressions (state / 
transforma,tions) . Every point of phase 
space has its analogue in the system’s 
index space. 
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In a formal sense, both indexing and measurement involve the production 
of a result from a classificatory act on an object of interest. In IS&R 
the object of interest is a docinaent. Indexing, much like measurement, 
serves to reduce the uncertainity concerning which data elements are present 
in the input document. Clearly, careful observation (measurement) is 
required to narrow the a vTioTi alternatives for classif icatic. The result 

is the ability of the system to. fit a given data element into tu_ index. 

However, since most indexing systems are imperfect, the associated measure- 
ment operation must involve some uncertainty. This uncertainty correspondc 
to the indexing system's inability to exactly specify the correct point in 
phase space. This indexing "noise" or "error" is conveniently accounted for 

by the Heisenberg uncertainty principle [50] . 

Despite the existence of indexing "error", the indexing system as a 
phase space effects a considerable reduction in the entropy of the document- 
space/searcher interface (Thm. 6.2). Prior to indexing, the searcher’s 
knowledge of the contents of r\e document space is minimal hence, his 
uncertainty is maximal. The indexing process identifies those elements of 
the phase space which are present in the document space, hence uncertainty, 
through the use of the index as the interface, is reduced. However, since 
the indexing system is adiabatically closed, such a reduction of entropy must 
be matched by a commensurate rise in entropy elsewhere in the system (Thm. G.-3). 
This rise in entropy is accounted for by the effort (mental, physical) 
required to effect the indexing process. Thus, the change in entropy is 
equivalent to the work. W, expended in the indexing process. 

The "Diomentum" coordinates of the indexing phase space are modeled by 
Rothstein [51] as the path of a search. If one adds the dimension of time 
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to the phase space, then a sequence of points represents a "Search" in phase 
space. Since we have equated the points of phase space to the entries of the 
system's index, then a path in phase space also represents a search through 
the index. Rothstein posits the existence of an dOevage soon or search that 
is required to retrieve the desired information from the system. The 
existence of an average search length ( >1 ) is a direct result of the error 
or uncertainty tha:. characterizes the structure of the phase space. We add 
the concept, of the 'ideal se.ax'oh which results in the retrieve . of information 
on the first access to the index, and the concept of the opt-imat secaeoh 
strategy which results in the shortest path to retrieval — short of the ideal 
search. It is to be expected that the retrieved data elements resulting 
from such forms of search have varying informational values or benefits 
associated with them. Such considerations will be discussed in Sections 16 
and 19. 

16. Course of Action as Hypothesis Testing and Decision Making 

The previous sections of this chapter have contained discussions concern- 
ing how the indexing system represents the elements of the document space. 
Attention has been directed to both the manner and the form of this represent- 
ation. We shall, for the sake of further discussion, assume that tho index- 
ing system has performed its function (the quality of the performance is 
another matter) so that we may now consider the nature of the process of 
conversion of stored data into information. We will only briefly discuss 
the concepts of goal, hypothesis testing and decision making, since a detailed 
consideration of their nature and role in information retrieval forms the 
substance of Chapter 5. 
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When a receiver (user) attempts to access the stored data, by means 
of the index, it is assumed that he has a goal in mind. Ve assume that, 
to the user, a goal represents a desired end product or end state. Several 
examples of retrieval goals can be given in the form of questions: 

Are there data on compounds X and Y? 

How does one convert X into Y? 

What is the mechanism of the reacti.on of X with Y? 

What is the effect of catalyst A on the reaction of X with Y? 

What other (than X) compounds yield Y under tht influence of A? 

One can easily conceive of these goals as representing separate but conceptual- 
ly related sequences of interaction with the retrieval system (cf. footnote on 
p. 107). But for each goal there is a corresponding course of action which 
is executed as a repeated interaction with the index and subsequent analysis 
of retrieved data elements. Furthermore, each interaction involves a 
hypothesis concerning the contents of the document space. Thus, each course 
of action is a sequence both of hypotheses concerning the 
contents of the data store and of decisions concerning whether or not the 
goal has been obtained. Recalling the discussion of Section 14.1, it is 
recognized that the retrieved data may provide information with respect to 
goal attainment. Thus, an initial hypothesis may be either refuted by the 
retrieved data or else it: may be incompletely supported; in either event, a 
new hypothesis must be formulated and new data examined. 

The progression between hypotheses, decision making and goal achievement, 
which was depicted in Figure 7,1, gives rise to two cases of data-element 
benefit, with respect to goal attainment. The sequence hypothesis- 
formulation/decision/goal is said to provide maximujv benef'Ct since the data 
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retrieved (information), in response to the initial hypothesis, completely 
"satisfy"* the goal. Conversely, the minimum benefit case is identified by 
the hypothesis-to-hypothesis path. In such a case, tie data that are 
retrieved are not sufficient to provide information concerning goal achieve- 
ment. As was previously implied, benefit intermediate between the maximum 
and minimum ^'-^nefit cases is obtained when the information is necessary but 
not sufficient to reach the goal, and the formulation of a new hypothesis, 
based on the nature of the data already obtained, is required. Thus, when 
the retrieved data are not wholly suited to the testing of the initial 
hypothesis, a new hypothesis must be formed. Although a decision must be 
made to formulate this hypothesis, we will call this a meta-deaision 
since it is not directly involved in the final attainment of the goal. Further- 
more, although information is obtained from the failure of an hypothesis, we 
shall refer to such information as meta-information since it is associated 
with a meta-decision. Data elements are of value in decision making (hence, 
are information) when they are directly involved with goal achievement. 

Hence, meta- information is not equivalent to information (Thm, 7.1). 

It is interesting to note that the types of data-element search that are 
carried out in phase space (see Section 15) can be conveniently represented 
as the progression between hypotheses. The ideal search corresponds to 
retrieval yielding maximum benefit (the H-D-G path) , whereas " /erage search 
is represented by a chain of hypotheses ter’ Lng with goal attainment / 

(H-D-H-. . .-D-H-D-G) . The optimum search strategy represents the user's / 

/' 

/ 

* The concept of the "satisfaction" of an information need will be disqussed 
in Chapter 5. 
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systematic variation of index-entry attributes in an effort to retrieve 
the desired information. The nature of such a strategy will be discussed 
in Chapter 5. 

It should be clear from the previous discussion that benefit is a relation- 
ship between the information obtained by the receiver and the number of 
decisions (meta- and real) required to satisfy an "information need". 
Intuitively, the information that is retrieved through the first query and 
that satisfies the goal has maximal benefit. However, we infrequently 
experience the maximal-benefit situation — rather, benefit must be accumulated 
over a sequence of queries. 

17. Perfect and Imperfect Indexing Systems 

By a fiction as remarkable as any to be found in law, what has 
once been published (no matter what the language) is usually 
spoken of as known, and it is often forgotten that the rediscovery 
in the library may be a more difficult and uncertain process 
than the first discovery in the laboratory. 

Lord Rayleigh 

This dim view of a searcher's likelihood of success in library search 
is further supported by Reid's [52) comments; "... a point will always be 
reached, eventually, where all competent judges must agree that the 
probability of finding a reference and its possible value if or when found 
do not warrent the time, trouble or expense involved in continuing [search- 
ing]." It is emphasized that, perhaps, the principal postulate of the theory 
of indexing propounded in this dissertation is that error in information 
storage and retrieval stems from error in indexing. The Indexing process, as 
usually implemented, does not accurately mirror the contents of documents in 
£r. As a consequence of this failure, a document indexed, for example, by 
the term "glass" may actually discuss a principle governing the action of 
metals or of undercooled melts [53]. Aside from search by "browsing", these 
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other "content descriptors" are forever lost. We need not prolong the examples 
of indexing failure (see Section 17.. 4 for a discussion of the failures of the 
back-of-the-book index); rather, lei: us contrast the concepts of "perfect" 
and "imperfect" indexing systems and attempt to draw some conclusions concern- 
ing areas for the improvement of current indexing processes. 

17.1 The Theoretical and Real-World Indexes 

The "perfect" indexing system operates according to the principles j 

embodied in the indexing theory, hence, we shall call the output from this / 
system the theoretical index. We define the theoretical index as serving to 
represent all inter- and intra-document relations between data elements in 
the document space. It is assumed that order-preserving operations and 
transformations are employed at all steps of the "perfect" indexing process. 
Recalling Mellon's definition of the good index presented in Section 12.3: 

"...an index will serve as a reliable means for the location, with a minimum 
of effort, of every bit of information [data] in the source covered. it 
is concluded that every data element occurrence must be indexed so that the 
contents of the document will be available to every potential user and query. 

This is the essential role of the theoretical index. 

If we assume that there are potentially d data elements in a given document, 
then each data element serves as a two-valued function — either the document 
has the datum or it does not. Consequently we could define 2*^ subsets of 
data elements by the operation of set intersection. The theoretical index 
must provide for the existence of at most 2^ connections (shared relation- 
ships) between data elements. However, since there are multiple documents 
(say m of them) in the document space, , then one must allow for the 
existence of an increased number of data-element relationships. One bound 
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(limit) on the number of data-element relations is 2 2 “ and, by 

definition, the theoretical index must be able to represent any subset of 

^(d^+. ..+d^) ('j’j.jgQj-gjjj 8.1). This large number of relations is calculated 
under the assumption that the data elements associated with the m documents 
are unique. A more manageable upper bound for the number of potential 
relations that could appear in the theorectial index would be 

(d n d ... n d ) 

2 ^ . Since a query, Q, is mapped into the index, I, we can 

define a mapping of requests into the subsets of possible data-element 
relations. This mapping is between data elements of the query and entries of 
the index; 

n 1 

Q 2^=’ 



In contrast with the "perfect" indexing system, the "imperfect" indexing 
system is characterized by its output — the vecLi-wovZd ind&x. The essential 
difference, as we have previously mentioned, is that real-world indexes 
contain significantly fewer index entries than would have been represented in 
the theoretical index. Consequently, there is a loss both of important data 
elements and of significant relationships between data elemer:ts. We 
postualte that for a given document space, real-world indexes fall short of 
the theoretical index because the indexing of the document space is 



incomplete. 

An example of the incompleteness of "imperfect" indexing systems can be 
found in a comparison of the theoretical growth rate of four well-known index 
ing methods with their operational counterparts, the growth rates of x^rhich 
are severely restricted by means of word control lists and simple index-size 



; r> *2 



are 
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limitations. Figute 17.1 shows the theoretical (solid lines) versus the 
real-world (dotted lines) growth rates of an articulated index [54], the 
SLIC index [55], and the Uniterm or keyword index. The theoretical growth 
rates are as follows: articulared — odd members of the Fibonacci series; 

SLIC — Uniterm — n, where n is the number of data element s/ document . 

Clearly, real-world indexes do not provide a sufficient number of index 
entries.* Figure 17.2 shows the performance of these indexing methods with 
respect to the hypothesized number of relational entries. Values above 
the equality-of-number-of-the-index-entries-to-Z’^ line represent , redundant 
entries, whereas, values below the line indicate poor performance. Interest- 
ingly, for large numbers of terms, the simple combinations of terms show the 
best performance. However, it can be argued that the SLIC index performs 
just as well since all combinations of terms can be easily generated from 
this index. One can also argue that a consistent deletion of redundant 
entries is desirable. 

In general, then, published indexes fall short of the theoretical index. 
Reasons for this phenomenon could be the lack of adequate technology for 
large index storage (we will briefly discuss this point in Chapter 6) or the 
prohibitive cost of the generation of a large number of index entries. It 
is realized that the theoretical index and the perfect indexing system are 
unobtainable**, however, there is a positive value in knowing what the ideal 

* For n > 6 we have the following ordering by decreasing number of entries: 
permutation (nl), articulated, combinations, SLIC, double KWIC (n(n-l)) 

[56], Uniterm. 

** The third law of thermodynamics tells us that the entropy of a pure quantum 
state is 0; or, that complete certainty about the document space is impossible. 




No. of Index Enfries 



120 



A 




Number of Terms 

1 



Figure 17.1: Theoretical vs, Real-World Index Growth 

(A = Articulated index; S = SLIC index; 
U/K = Uniterm or KWIC index; c. . = real- 
world, == theoretical) 
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Figure 17.2: Relationship Between Real-World Indexes 

and the Theoretical Index. / 

(A = Articulated; C = Combinations; S = SLIC 
index; U/K = Uniterm or KWIC indexj 
... = real-world, = theoretical) 
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is — if only for purposes of evaluation of operational systems. 

17.2 Possible Real-World Index Improvements 

Index size is a sensitive subject. By example, the size of a back-of- 
the-book index cannot grow, for reasons of economy, to a size equal to or 
larger than the size of the book! However, any increase over present day 
sizes would be beneficial in terms of efficient and accurate Information 
retrieval. The only practical way of increasing the accuracy of a book index 
is to increase the number of index entries, and certainly to increase the 
number of multi-term entries. s would go a long way toward increasing 
both the number of relevant data elements and data-element relations. 
Unfortunately, most other forms of indexes suffer from a lack of "depth of 
indexing" [57] and would therefore benefit from an increase in the number of 
entries. By example, adding subject, classification or text enrichment terms 
to a document title will, in many cases, vastly increase its utility in a 
retrieval data base— especially when the title is used in a KWIC index.* 

The main problem is that depth of indexing is not solely associated with the 
number of index entries (i.e., the number of keywords), but relies on the 
exactness of ths specification of data— element relations. Accurate retrieval 
depends on the commonalty of data elements and relations. We shall consider 
the representation of data elements and relations by a finite state graph 
(really a model of the language component of 3 ) • 



* One unexplored possibility is the use of KWIC indexing to represent 

bibliographic citations. Indexes could be prepared not only for authors, 
but also for title terms, sources and dates. This would eliminate the 
tedious scat.uing of lengthly reference lists. 
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Figure 17.3 shows a very simple example of two data elements associat- 
ed by means of three relations: A, B, C. By construction, all three relations 

fall into the same equivalence class (since they all link the two data 

elements), however, each relation will serve to identify a different Set of 
documents. Thus, the product of all relations between data eleflient 1 and 
data element 2 will yield all of the documents in some way related to both 
data elements. 

In this model, new relational equivalence classes can be e^isily defined, 
for ecample, by specifying a directional nature to the edges of the graph 

the "paradigmatic trees" in SYNTOL 158]). It is noteworthy thet the 
larger the number of simultaneous relations in the query (i.e.^ the larger 
the query set) the smaller the number of retrieved documents. Also, the 
longer the path between any two data elements (the more included dat^ 
elements), the fewer the number of documents that will be retrieved,, 

The data-element/relation system (Figure 17.3) is completely defined 
(macroscopically) by the initial and fincii states (data-element 1 and 

I 

data-element 2 in this case). However, the entropy of the "rel^tion^l phase 
space" is greatly reduced by the actual identification of the specific 
relations A, B, C... This enables the indexing system to precisely define 
the relative "position" within the phase space of all the docum^hts in the 
collection. 

However, such a "structural" representation is not presently available 
in indexing systems. Data reaches the indexing system in the fotnj of natural 
language strings, whose elements exhibit strong syntactic and s'-itiantin 
relations. We can infer that the number of such relations is siEnifinantly 
reduced after indexing, as evidenced by poor retrieval results, paradigmatic 
O 
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Figure 17.3: Data Element Relation Structure. 

Circles represent tenns and lines represent 
relations . 
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systems such as links and roles only serve to place the problem on a 
different level since they can provide only a limited number of syntactic 
or semantic markers for a data element. It is possible that a data element 
may not belong precisely to any such classes, and the potential misrepresant~ 
ation of the syntax and semantics in a document can lead to false retrieval- 
Such difficulties may be avoided by preserving in the index the syntactic 
and semantic relations among data elements as given in the original document. 
This problem is at least partially resolved by a Case Grammar analysis of 
natural language strings. 

Case Grammar was first described by Fillmore [59] in 1968 and presupposes 
causality and instrumentality in language. It is believed that the role and 
function ie.g.j "meaning”) of words in deep structure is accurately 
portrayed by Case Grammar. We shall provide only a terse statement of the 
nature of Case Grammar since ample exposition is offered elsewhere [60] . 

Case Grammar (not to be confused with traditional notions of case) focuses 
on the pivotal role of the verb in natural language phrases.* Nouns are 
viewed as exhibiting a relationship with the coordinating verb . The relation- 
ships identified by Ca.se Grammar include (and are denoted by the term case) : 
agent, instrument, object, experiencer, possessive, source, time, location, 
manner and degree. The remaining words of the phrase (adjectives, adverbs, 
prepositions, etc.) are treated as facets of the case nouns. 

The identification of the case grammar relations and the subsequent index 
entry generation involve six steps: 



* Our analysis, follovT-ing Cook [60], treats the clause as the basic 

"informational" unit in natural language discourse. "Clause" is defined 
as a word grouping containing one and only one predicate [62] . 
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1) the input of text (title or sentence) 

2) the identification of clauses 

3) the identification of the verb (or auxiliary) within clauses 

4) assignment of cases 

5) facet isolation 

6) index entry generation 

a) case index 

b) verb index 

c) facet index 

Figure 17.4 shows a case grammar analysis of the title of the example document 
introduced in Section 11 of this Chapter. The words of the title are listed 
(in order of their occurrence) by case, verb or facet membership. The form 
of display is adapted from Cook [61] . Notice that the co.S6 and V&vh entries 

give the "essence” of the title: "effect-preventing-falls-tension-following- 

isoprenaline-subjects," while the facets provide the specifics. Figure 17.5 
illustrates the index entries that would be created from this title. It is 
believed that the entries of the "Case Grammar Index" preserve the order of 
discussion and exhibit the organization of the underlying thought. 

Finally, Figure 17.6 presents a structural representation* of the subject 
title. Content words are represented by capitalized letters and function 
words are represented by lower case letters. Connections in the structure 
represent the logical (relational) dependencies between the words. Notice 
the complete isomorphism between this structure and the corresponding case 
grammar assignments. The nodes with the highest connectivity correspond to 
the case grammar entries and the surrounding nodes correspond to the facets. 



* Based upon that proposed by Rush [65] . 
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Document ir 820 

N P D A A N P PRT 

Effect of a selective beta-adrenergic blocker in preventing 

NPAAN V N P 

falls in arterial oxygen tension following isoprenaline in 

A N 

asthmatic subjects. 



Syntax: N = noun D = determiner PRT = participle 

P = preposition A = adjective V = verb 

Case grammar: 0 = object (receiver of action) 

LOG = location (place, extent, duration) 



CASE 



VERBAL FACET 



Effect (0) 




falls (0) 



, . cventi ng 



tension (LOC) 




isoprenaline (0) 




followi ng 



subjects (LOC) 



blocker, beta-adrenergic, 
selective, a, of 



oxygen, arterial 



asthmatic, in 



Case Grammar Analysis of a Title. 
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Figure 17.4: 
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Case Index 

Effect [0], preventing falls 820 

, blocker, beta-adrenergic, selective, of, 820 

Falls [0], effect preventing 820 

Isoprenaline [0], tension following 820 

Subjects [LOCJ, asthmatic, in 820 

Tension [LOC], following isoprenaline 820 
, oxygen, arterial 820 



Verb Index 

Following, tension 820 

, isoprenaline 820 

Preventing, effect 820 

, falls 820 

Facet Index 

Asthmatic, in (subjects) 820 
Ar terial, oxygen (tension) 820 

Beta-adrenergic, selective, a, of, blocker (effect) 820 
Blocker, beta-adrenergic, selective, a, of, (effect) 820 
Oxygen, arterial (tension) 820 

Selective, a, of, blocker, beta-adrenergic (effect) 820 



Figure 17.5; Index Entries Derived from Case Grammar Analysis. 
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Document # 820 

AabB C D Ec F 

Effect of a selective beta-adrenergic blocker in preventing 

G d H I J K L e 

falls in arterial oxygen tension following isoprenaline in 

M N 

asthmatic subjects. 




O = case index entry 
* = verb index entry 
= facet index entry 

Figure 17.6: A Structural Representation of the Title 

Showing Isomorphism with the Case Grammar 
Analysis . 
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An indexing system would process and store this structure by reducing it 



numbering algorithm) and one form of storage would be to represent the 
structure by the sequence of case nouns; A - - G ( ■- J) - - L - N. Queries 
in this system would be effected by means of sub~structure searches. 

18. The Index as a Tool of Ingui'n^ 

In this Chapter we have presented the basis for a comprehensive theory 
of Information Storage and Retrieval, Our thesis has been that this theory 
has its genesis in a theory of the indexing process. In other words, it 
is believed that the success of an IS&R system depends, primarily, on 
accurate and complete document representation, and that such representation 
is the goal of any indexing process. It has been contended that the index 
provides the necessary linkage between a multiplicity of sources and a 
single receiver. Conceptually, the indexing system is initially viewed as 
a black box that accepts documents as its inputs and produces the index as 
its only output. The various sources produce the documents which become the 
elements of the document space and the receiver produces queries which are 
matched against the index and, eventually, against the document store. Whether 
considering the source/document-space interface or the query/index interface, 
the elements of the underlying communication phenomena are the same: data 

elements and relations between data elements. Following the progression of 
schema presented in Figure 10.1, we first considered the necessary criteria 
for effective communication and concluded that the index provided the 
requisite common experience set between the source and the receiver. We then, 
more precisely, positioned the indexing system as intermediary between the 
communication channel and the receiver (searcher) and emphasized the role of 



to a connection-matrix representation ie.g.^ by means of the Morgan [64] 
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**noise** and feedback. Following a specification of the "position^* of the 
black box or indexing system, we considered a theory of its operation. This 
theory, called the indexing process, defines the essential operation of the 
indexing system to be the creation of the representation of the document 
spacer The analysis-document transformations and the final index-entry 
transformations were shoVn to be, respectively, a prerequisite to, and a 
function of, the document-space representation. Adequate examples of these 
transformations were provided through an analysis of the example document 
introduced in Section 11. Finally, the operating characteristics of the 
indexing system were modeled by means of the index space. From a different 
point of view, the concepts of error, organization, information and search 
were introduced through a consideration of the indexing process as a 
thermodynamic system. We could then postulate the existence of the **perfect” 
indexing system and the theoretical index as compared with their real-world 
counterparts. 

We have cast the indexing process as a mechanical, well-defined set of 
operations; however, the use of the index data, by the receiver, presents an 
altogether different problem. As a consequence, a theory of the indexing 
process must also provide the means for the description of the process of 
searcher/ index interaction. The modeling of the process of interaction is an 
admittedly "fuzzy" undertaking, but, it is believed that an understanding of 
the processes of index creation will provide the basis required for the 
analysis of search. Thus in this section we consider briefly the problem of 
the searcher-directed conversion between data and information and the concept 
of data element "value." 
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Following the discussion presented in Section 16, we must conclude 
that a query represents the searcher’s hypothesis about the contents of the 
document space. However, only rarely would an initial hypothesis prove to 
be satisfactory with respect to the searcher's goal or "information need." 

It is believed that either he has an incomplete understanding of the 
organization of the system (the nature of the index entry and the indexing 
system representation) or he is unable to adequately formulate a hypothesis 
about its contents. Thus retrieval or search was modeled as a series of 
hypotheses and decisions which eventually end with goal achievement. Hope^ 
fully each interaction with the index leads to more precisely specified 
hypotheses and to hypotheses which are co. .ansurate with the structure of 
the data base; such hypotheses will have a greater probability of yielding 
the desired goal. 

We assume that the paths of maximal and minimal retrieval benefit, 
described in Section 16, have a small probability of occurrence. Consequently, 
most searcher/index interaction is adequately modeled by the " 
case characterized by the alternation of hypotheses terminating with goal 
achievemen t . Henc e , 

Theorem 9.1 (Proof): The first data element retrieved, as a 

consequence of the first interaction with 
the index, will be of small benefit in goal 
achievement, since it will only provide meta'^ 
information leading to the formulation of a 
new hypothesis. This is a consequence of 
the high probability of occurrence of the 
intermediate case. 

In the case of the maximal benefit, or the H - D - G path, we say that the 
data element that is retrieved, in response to the single hypothesis, has 
maximal utility or value since it immediately satisfies the information need 



■> ^ 
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of the searcher. However if the same data element Is retrieved after a 
series of hypotheses then its value, with respect to the information need, 
will have decreased or decayed. In the intermediate case, the retrieval of 
a given data element is dependent upon the prior sequence of hypotheses and 
data elements. Consequently, the decrease in utility of a data element is 
directly related to its position in a sequence of retrieved data elements 
(Xhm. 9.2). The more hypo thesis- testing and decision-making steps prior to 
the retrieval of a given data element then the smaller its utility — f.e.j 
the smaller its information content. Thus, any index/retrieval interaction 
longer than one operation sets up an nth order dependency between the nth 
retrieved data element and the n-1 ones previously retrieved. 

We postulate that the value of a data element, with respect to goal 
achievement, is Poisson distributed over time. In Figure 9.1 the designated 
time intervals correspond to a succession of alternate hypotheses and 
consequent decisions. According u,.> previous discussion, the greater the 

number of time intervals prior to the retrieval of a given data element then 
the smaller its utility or value. The downwards sloping curve is a direct 
consequence of. the nth order dependency between successively retrieved G-3ta 
elements. The choice of a value for the parameter X in the Poisson distribution 

v! 

controls the ralie of decrease of value and is assumed to be character is tiji of 
a given retrieval s±t nation. It is possible that the value of X depends on 
the experience of the searcher and that a decrease in the slope (over a 
sequence of sets of interactions) corresp'^nds to the searcher learning t±e 
attributes that characterize the system and the index. 
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Assuming that the searcher is following the intermediate-path case then 
we postulate that although the value of each successively retrieved data 
element decreases, the rate of decay of the utility of newly retrieved data 
elements decreases with each new hypothesis. Figure 9.2 shows an envelop 
curve, E, which is the Poisson distribution of data element value. At each 
time, t., a new hypothesis is formed and a new data element is retrieved 
(of course, several data elements could be retrieved in response to a single 
hypothesis). The several curves, originating at each data element initial 
value, represent the decay in the utility of the data elements for goal 
achievement. The really significant observation is the number of data elements, 
in a given time interval, that are potentially of value for hypothesis testing 
and formulation. Thus, in the interval between t^ and t^ both of the data 
elements retrieved at t^ and t^ are useful in decision making and hypothesis 
formulation. We postulate that the rate of decay of the utility of a given 
data element is a fimction of the initial value (given by the curve E) and 
the necessity of forming tha next hypothesis or making the next decision 
(characteristic of the problem solving situation) . 

The most obvious conclusion from this brief analysis of the search 
interface is the need for a "process of inquiry" characterization of the 
process of retrieval. We have argued that the indexing system must present 
data elements and relations to the searcher, but how are we to evaluate the 
effectiveness of this presentation — especially when comparing alternative 
systems? Possibly the greatest hindrance to such an understanding is the 
lack of precision associated with the concept of "information need. What 
does it mean to say that data elements satisfy an information need? According- 
ly, this will be the major topic of discussion for Chapter 5. As we shall 
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see, an understanding of ''information need" will resolve the apparent 
divergence between the concepts of retrieval effectiveness, the concept of 
search paths, hypothesis testing and the searcher's understanding of the 
organization of the document collection. 
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APPENDIX A 

A Case History Illustrative of Index Failure * 

During the closing stages of his doctoral research career, a friend 
was engaged in the study of certain chemical reactions which he 
characterized generally as follows ; 

RBCOH)^ + 2 Cu^ + K^O RX + Cu^X^ + HX + B(OH)^ 



Believing that his searches of the literature had uncovered all available 
data on this type of reaction, he was somewhat chagrixied to learn, near 
the end of his studies of the reaction, from a colleague that another 
document of (apparent) importance existed which he had failed to unearth. 
More than a little unsettled by this occurrence, he made an exhaustive 
effort to find the document in question through all available means. 

There were a number of obvious places one might look in an index for this 
document- Some of these are listed below without embella^bmiEnt . 



1. 

1 


RB(0H>2 


1 2. 


CuX^ 


1 3. 


EX 


1 4. 

} 

1 


CU 2 X 2 


1 5. 


Reaction of RB(0H)2 with C 11 X 2 


6 . 


Production of RX from RB(0H)2 


7. 


Reaction mechanisims, of RB(0H)2 with CUX 2 


CO 


Specific compound names (of which there were potentially a large 
number) 



* Names have been omitted to avoid unnecessary adverse criticism of 
specific persons or systems. 
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Since specific compounds reported in the document could not be anticipated, 
my friend had recourse only to those index entries of a more general 
nature, plus those specific compounds v^hich has experience suggested he 
check. This rather exhaustive search failled to yield the desired 
document. So, my friend, having retrieved the document by means of the 
information supplied by the colleague, endeavored to find the index 
entries which corresponded with specific details in the document. The 
result obtained was that only the specific RB(0H>2 compounds were indexed 
(without qualification) and that no index entries had been generated 
for any other part of the document. 

The obvious conclusion which he drew was that a gross error had 
been committed in the indexing of this particular document. But this 
failure causes one to wonder whether other similar failures have gone 
undetected (where this one was detected only by chance). In any event, 
such occurrences certainly put the index user in an uneasy frame of 
mind, and, if opportunity exists, he will most likely turn to a different 
index rather than take a second chance with the one which has failed him. 
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CHAPTER V. ON RELEVANCE AS A MEASURE FOR IS&R 



"You have seen the literary articles which have appeared at 
intervals in the Eatanswill Gazette in the course of the last 
three months, and which have excited such gene al— — I may say 
such universal — attention and admiration?" 

"Why", replied Mr. Pickwick, slightly embarrassed by the 
question, "the fact is, I have been so much engaged in other 
ways, that I really ha’^e not had an opportunity of pursuing 
them. " 

"You should do so, sir," said Pott, with a severe countenance. 

"I will," said Mr. Pickwick. 

"They appeared in the form of a copious review of a work on 
Chinese metaphysics, sir," said Pott. 

"Oh," observed Mr. Pickwick; "from your pen, I hope?" 

"From the pen of my critic, sir," rejoined Pott with dignity. 

"An abstruse subject I should conceive," said Mr. Pickwick, 
"Very, sir," responded Pott, looking intensely sage. "He 
crammed for it, to use a technical but expressive term; he 
read up for the subject, at my desire, in the Encyclopaedia 
Bnitannica . " 

"Indeed." said Mr. Pickwick; "I was not aware that that 
valuable work contained any information respecting Chinese 
metaphysics." 

"He read, sir," rejoined Pott, laying his hand on Mr. Pickwick's 
knee, and looking round with a smile of intellectual 
superiority, "he read for metaphysics under the letter M, and 
for China under the letter C, and combined his information sir." 

Charles Dickens 



1 . Introduction 

There is a timeless quality to the method used by Mr. Pott's critic. 
Indeed, Dickens has created a character who might be our contemporary. 
Information storage and retrieval systems have changed — they are larger 
and faster; but the same problems still exist — retrieval is still accomplish- 
ed by the elementary combination of "information" on various subjects. Our 
methodology has changed but we are still as unsure of the result as was 
Mr. Pott. The problem of retrieval-system evaluation becomes of paramount 
importance. It is assumed that an understanding of the concept of nelevance 
is essential to the solution of systems evaluation. Accordingly, this 
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chapter is directed toward the definition of the problem of relevance, 
to definitions and history of the relevance and evaluation concepts as appli- 
ed to system perfonauncej to a schema for IS&R systems evaluation and, 
finally, toward new directions in systems evaluation. 

2 . The Problem of Relevance 

Wooster [1] has recently enumerated some of the many criteria avail- 
able for the evaluation of the effectiveness of Information Analysis Centers. 
There criteria generally fall into five broad classes: need, use, cost, 

performance and benefit. Although his listing was specifically directed 
toward the Anatysis Center concept, similar criteria are easily applied to the 
general information storage and retrieval evaluation problem. Wooster’s 
exposition shows that current measures are diverse and exhibit little 
consistency of approach. Such conditions can only lead to confusion. Unless 
we throughly understand the problems of system evaluation, performance and 
benefit, efforts toward system description and comparison will merit Rees’s 
[2] phrase: "...busy people spending large sums of money, designing — or 
attempting to design — phantom systems for non-existent people in hypothetical 
situations with unknown needs." 

But then, what is e-Vdtuationl To Richmond [3] 

The very term evaluation suggests a qualitative procedure — 
making a value judgment. The quantification of evaluation 
is a matter of abstracting those factors that are not purely 
human, apparently such as performance and operation, and 
setting them aside to function as data upon which to make a 
value judgment. 

Evaluation, then, reduces to "making a value judgment." But why? It is easy 
to say that a judgment is made of performance and operational data, but 
toward what end? This question is partially answered if we assume that a 
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value judgment serves as a two place predicate between a datum and a well- 
defined end or goal. We say, for example, "x equals y", "x satisfies 
the goal y”, or "x completes the process y." In information storage and 
retrieval such a value judgment does exist — but between which "x" and 
which "y"? There are five candidates available from the elements of the 
retrieval process icf. Chapter IV) : 

• The informational need (goal) of the user. 

• The expression of this need (query) . 

• The corpus of documents. 

• The system for retrieval. 

• The set of retrieved documents. 

As we shall see later, maiiy value judgments can be made on the performance* 
of system components (indexing, abstracting, thesaurus, document acquisition, 
query processing, etc.) that are apt to be hidden from a user; but the value 
judgment of primary importance is that which speaks to how effectively, from 
the user's viewpoint, the objectives of the search are being met. This value 
judgment is the correlation between the expression of the need (the 
formulation of the query which represents the informational goal) and the 
set of retrieved documents (the result of the system’s action). The most 
pertinent question, or judgment, from a user's point of view is whether the 
retrieved documents satisfy the goal requirement. 

A logical extension of Richmond's view of evaluation is offered by 
Stevens 14]: 

The most generally accepted criterion for appraising the 
effectiveness of indexing [or systems in general] is that 




* 



Performance is usually measured in terms of efficiency (retrieval time 
and cost parameters) and effectiveness (goal satisfaction) . 
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of retrieval effeativeness. But, in general, this is merely 
the substitution of one intangible for another, entailing 
a string of yet unanswerable or at least unresolved questions . 

Retrieval of what, for whom, and when? How can effectiveness 
be measured except by the relative question of relevance judg- 
ments? How can human judgments of relevanoe and value [italics 
added] be measured and quantified? 

Thus, a value Judgment between the retrieved documents and the expression 
of a need is relevance. The criterion of relevance is an expression of the 
connectivity or linkage between documents and a request. As Hillman [5] 
sees it: 

The problem is to describe a concept of relevance independent 
of, and logically prior to, any notion of relevance as determin- 
ed by, and thus restricted to, a particular system of storage 
and retrieval. 

This is, as we shall see, the most logical criterion for system evaluation, 
but the best measure, instrument and methodology have yet to be implemented. 
Accordingly, relevance measurement has a lengthy and conceptually fuzzy 
history in information storage and retrieval. 

3. Definitions and Measures of Relevance 
3.1 Definitions 

The domain of Information Storage and Retrieval suffers from an over- 
abundance of definitions of relevanoe. Consistency is difficult to maintain 
between studies because each Study that is undertaken involves a different 
definition of relevance. In addition, the introduction of new terminology 
forces the construction of new definitions of relevance, or the modification 
of previous ones. In this review, for convenience of presentation, we have 
chosen to condense and enumerate the various definitions of relevance under 
four headings : 
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• Dictionary definition; 

-Relevance is a relation to the matter at hand. 

• Communication definition: 

-Relevance is a measurement of information transfer. 

-Relevance is a phenomenon of communication indicating 
relations . 

• Value definition; 

-Relevance is the amount of satisfaction in information 
transfer. 

-Relevance is the "appropriateness" of the document to 
the user. 

-Relevance is the "utility" of the document to the user. 
-Relevance is the "satisfaction" derived by the user. 

• Connection definition; 

-Relevance is that fraction of the retrieved material 
that is actually relevant to the request. 

—Relevant documents are those that describe situations 
identical with that specified by the requester. 

-Relevance is a user decision about document /query match. 

-Relevance is the occurrence of each descriptor of the 
search profile of the request in that of the docvunent. 

Very few of these definitions are of immediate use in the quantification of 

the document-relevance/decision process. They are largely qualitative and 

should be interpreted as merely indicative of a phiZosophy of relevance. , 

It is premature to suggest another definition of relevance, but it would 



be advantageous to list some important attributes and desirable features of 
an improved definition of relevance. Some of these attributes are included 
in the conclusions reached by the 1958 Intematzonal Conferenae on Soienee 
Infomatton [ 6 ] ; 
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• Relevance is more than the operation of relating what is 
performed ^inLerually within systaissi 

• Relevance is not exclusively a property of document content; 

• Relevance is not a dichotomous decision; 



« There is such a thing as ”user relevance" that can be judged. 
This listing indicates directions for future reZevanoe research. 

3.2 Measures of Relevance 



Bourne, in his review of the Evaluation of Indexing Systems [7], 
identified 31 terms used in relevance measures. Many of these terms differ 



only superficially, but their number certainly indicates the diffuseness of 



the state-of-the-art of measures of relevance. For this very reason, they 
are deemed worthy of enumeration: 



recall 

recall factor 
recall ratio 
relative recall 
noTnnalized recall 
relevance 
relative relevance 
generality 
generality ratio 
precision 



precision ratio 

normalized precision 

sensitivity 

productivity 

relative productivity 

specificity 

effectiveness 

hit rate 

acceptance ratio 

completeness 



accuracy 

efficiency 

snobbery ratio 

fallout ratio 

discrimination 

distribution 

resolution 

elimination 

pertinency factor 

omission 



Bourne's conclusion denies the existence of a current soienoe of veZevonoe 
aBBQQBment : 



The experimental work reported in the literature seems to 
have used almost completely different measures for evary 
single experiment reported. 

The results of these studies are not only frequently non-reproducible, they 
are non-comparable. 

From these introductory remarks it seems pointless to dwell on these 
"measures". The interested reader is thus referred to several good reviews 
of the pertinent literature [8-11]. The conclusions to be drawn from this 
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literature are not encouraging; There are too many different measures of 
relevance; there are too many mathematical forms employed; there are too 
many variations in method; and, the results, if at all meaningful, are 
mostly system specific. 

Early studies characterized relevance as a highly subjective measure 
that indicated the degree of match between retrieved documents and a query. 
It was generally agreed that the concept of relevance was not identical to 
the contents of documents, but was, rather, a form of conceptual relatedness 
between query and retrieved document. Efforts at quantification produced 
the measures of recall and -precision that serve as the formal basis for most 
of the 31 terms listed above: 

recall = relevant documents retrieved 

Number of relevant documents in the system 



recision - Number of relevant documents retrieved 
Number of documents retrieved 

Suddenly (as Taube [12] argues) , the subjective notion of relevhnce is given 
z spuriously precise, mathematical definition. Taube cites the Cranfield 
studies [13] and the Arthur D. Little report [14] as primary causes for what 
he terms the pseudo -mathematios of relevance. A series of questions may be 
asked of these "^asures: How does one know a priori the mmher of relevant 

documents in a system} Does recall apply only to contrived systems where 
the team of evaluat<7ts has total, knowledge of the system's contents? (Or, 
if not, and total knowledge is available, then what is the excuse for a 
system yielding recall values below 100%?) What is meant by relevant 
documents'} Is this a circular definition of relevance? 

These measures are even less palatable, when they are applied to the 
task of systems evaluation. Let us cons.ider an example of the application 
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of the measures of recall and precision to the evaluation of a hypothetical 
system. First, we assume that this system has the mechanism by means of 
which a search (query formulation) can be performed over its data base. 
Second, we assume that there exists a facility to vary sea?:ch strategies, 
through, for example, boolean combinations of terims, in order to increase, 
decrease or mix values of recall and precision. Finally, we assume that 
for each search a value of recall and precision can be calculated.* The 
details of a general search (A) and of successively more precise sub- 
searches (B-E) are as follows : 



Search 


Search Terms** 


General search strategy (A) 


M only 


Sub-search (B) 


M&N 


Sub-search (C) 


M&N^ 


Sub-search (D) 


M&N.I&0 


Sub-search (E) 


M&N^ &0^ 



Exhaustive analysis of the system's data base yields a recall/precision point- 
pair for each of the above searches. Their hypothetical values are plotted 
on a standard recall/precision graph (see Figure 3.2.1). 

One may not argue the existence of these points. But there is no reason 
to assume that these separate data points can be joined by a curve. If we 
remember that a curve is an expression of a functional relation between data 
points, then it is not evident that such a relationship exists between 



* This is an unwarranted assumption: in addition to the problem of deciding 

what is relevant, the total number of relevant references is unknown short 
of exhaustive system search, 

** N is generic to ; 0 is generic to 0^; & represents logical AND. 
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Figure 3.2.1; The Recall-Precision Graph. 
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successive determinations of recall/precision point-pairs. Recall/precision 
graphs are often used (seemingly, unknowingly) to indicate an unvalidated 
dependence between successive searches. 

Viewed differently, the assumption of a functional relationship implies 
the existence of values intermediary between the five data points plotted 
in Figure 3.2.1. This also implies that there exist additional terms in 
the system (apart from other combinations of M,N,N^,0,0^) that can be used 
to vary the specificity of the search. The assumption of the existence of 
additional terms is faulty if the only index terms in the system are M,N, 

Lancaster’s [15] experience with MEDLARS^ has shown a wide point- 
scatter in recall/precision plots obtained from his test search results. 
Indeed, the recall and precision points he obtained for- the various test 
searches were near random in nature. Clearly this suggests the absense of 
any correlation or functional relationship between searches. It appears that 
the measures of recall and precision are simply indicative of a ''fuzzy-zone*' 
of system performance. 

As if to destroy the utility of system performance curves, O'Hara [16] 

has demonstrated that of the eight possible boundary positions on a recall/ 

4 - 

precision plot (see Figure 3.3.2), two require contrived definitions to be 
meaningful, and two are clearly impossible (0% recall and less than 100% 
precision, 0% precision and less than 100% recall). 

Extensions of the concepts of recall and precision have involved both 
micro and macro definitions [17] , probability measures [18] , decision table 
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0% Precision, < 100% Recall 



Figure 3.2.2; Limiting Cases Associated with a Recall- 
Precision Graph. 
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analysis [19,20], expected search length J21] , and relevance feedback [22] 
to mention a few. In my opinion these studies rest upon faulty postulates- — 
i.e.j recall and relevance. Informally and from a subjective viewpoint, 
these measures may have some value, but they cannot serve as a basis for 
formal analysis. Clearly, relevance has yet to be assigned a definition 
suited to quantification and mathematical manipulation. Without such a 
definition, empirically testable generalizations about systems performance 
are impossible. 

4. A Schematic for IS&R Systems Evaluation 

If it is difficult to find a consensus on a definition of relevance, 

it is as difficult to evaluate comprehensively studies of systems evaluation. 

The number of factors and interfaces examined in these studies are numerous. 

Cuadra [23] has identified some of the features of relevance Judgments: 

Evidence has been developed that suggests that relevance 
judgments can be and are influenced by the skills and 
attitudes of the particular judges used, the documents 
and document set used, the particular information require- 
ment statements, the instructions and settings in which 
the judgments take place, the concepts and definitions 
of relevance employed in the judgments, and the type of 
rating scale or other medium used to express the judg- 
ments . 

Cuadra ’s [23] final report in Experimental Studies of Relevance Judgments 
identifies four broad classes of factors that influence the relevance 



judgment decision: 
Document 



-Subject matter 
-Diversity of content 
-Difficulty level 
-Scientific hardness 
-Amount of information 
-Level of condensation 
-Textual attributes 




Judgmental Conditions 

-Time of judging 
-Order of presentation 
-Size and breadth of document 
-Use of control judgments 
-Specification of task 
-Definition of relevance 
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Information requirement statement The Judge 



-Subject matter 
-Difficulty level 
-Diversity of content 
-Specificity of information 
-Functional ambiguity 
-Textual attributes 



-Knowledge/ experience 
-Intelligence 
-Cognitive style 
-Biases 

-Judging experience 
-Attitude 

-Distribution expectancy 



To me, the problem lies in the number of different interfaces (or points of 
correspondence) in the information storage and retrieval process where a 
value judgment can be effected and measured. The following model is present- 
ed as an aid in the clarification and classification of the various relevance 
judgmental decisions. 

4.1 The Mode! 

Figure 4.1.1 shows the position of the index in the infoimation storage 
and retrieval process j note the importance of the feedback process in the 
search operation. The model is divided into three units based on the funda- 
mental operations of document creation, representation and retrieval. These 
operations are also depicted in Figure 4.1.2, which shows that the indexing 
operation encompasses document acquisition, representation and storage, while 
retrieval is represented by the exchange between a user^s informational need 
and tne expression of this need. The double arrow between these two components 
is indicative of the feedback that is essential to these activities. Noise 
has not been indicated but it could perturb any of the components or operations 
involved in the communication process • The seven dotted lines indicate the 
various types of judgmental decisions that are applicable to systems evaluation. 
Each is described briefly as follows: 
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I r~ — Judgment G — 




(judge) 



Figure 4.1.2: Evaluation Judgments. 
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Judgment A : 

Judgment A is probably the least studied interface because of the 
difficulty of its measurement. However, to a first approximation, a system 
is only as good as the documents that it collects. This is why efforts 
should be directed toward evaluation of system input procedures. From a 
slightly different point of view, Paisley and Parker [24] have recognized 
that source deficiencies may severely limit system performance. To this 
author's kniowledge no comprehensive studies have been undertaken dealing 
with judgment A. 

Judgment B ; 

The ’’aboutness*' judgment (as Fairthorne [10] puts it) receives the 
greatest amount of attention by researchers — possibly because it is the 
easiest to measure. As discussed in Chapter IV, accurate document represent- 
ation is the most important system input function. Accordingly, the 
exhaustiveness of the indexing and the specificity of the indexing language 
are the most often studied. While Zunde [25] concludes that "documental" 
factors are the most important parameters in indexer consistency, St. Laurent 
[26] , in a comprehensive survey, opines that no conclusion can be reached 
from the diverse studies of indexer consistency. One of the better summaries 
of this interface is proposed by Lay [27]. He defines three types of index 
terms (T) and relations (R) between terms based on their presence in the 
document, the index, or both (see Figure 4.1.3). It is believed that an 
effective study of interface B might employ such a decomposition as its under- 
lying model. 
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In Document In Index 

Ti + + 

Ta - + 

T3 + 

Ri + + 



R2 



i 

I 



R3 



+ 






+ = present 
- = absent 
T = terms 
R = relations 



Figure 4.1.3; Terra/Relatiou Analysis, 





Judgments C and D ; 



In my opinion the most significant correlations for evaluation are 
those between the need and the expression of the need, and between need and 
the retrieval output. Unfortunately these distinctions are largely overlooked 
in the literature of relevance. Unless the user’s need is satisfied the 
retrieval is not effective. 

Judgments E and F ; 

Taulbee £9] characterizes these judgments by the decision on the 
relationship between the "information** need and a given document. (There 
is some debate as to whether this judgment should be on a corpus of retrieved 
dociiments rather than on a 1-1 basis -see Goffman [28]) . These relevance 
judgments are often formalized by means of document and query vectors that 
permit facile comparisons. The reader is directed to reports on the SMART 
system [17] and of Ide’s £22] **reievance feedback'* analysis for details. 

Although it is claimed that these judgments can be made by the user or by 
an independent judge (observer or mathematical criterion), O’Conner [29] 
takes a different view: 

The basic causes of relevance disagreements are differences 
in interpretation of requests or documents, rather than such 
factors as the education of the judges and what they take to 
be the purpose, environment and timing of the request. 

Judgment G ; 

No studies have dealt with the difficult questions of the correlation 

between the user’s need and the system ^'s representation (we will not consider 

seZeotive d'isseJV'Cnation as representative of the essence of judgment G) . 

Leslie £30] humorously depicts this difficult judgment: 

Somebody has .imfwmm- indexing as a game involving two players- — 
an indexer and m mmesr.. In this game, the first player (the 
indexer) tries to where the user will look for a particular 
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record. The second player Cthe user) tries to guess where 
the indexer put it. The game gets a little complicated when 
the user tries to guess where the indexer guessed the user 
would guess the indexer guessed the user would look for it. 

Many systems can, unfortunately, be described in these terms. 

5. Directions 

It has been argued that the systems evaluation problem reduces to the 
task of document velevance assessment. This conceptual reduction does not, 
however, yield a corresponding reduction in the difficulty of solution. 
Furthermore<- the situation is worsened by the many **pseudo-** measures of 
relevance, all of which seem to rely on the less-than-satlsfactory measures 
of recatt and px*eaision. Clearly, the task of relevance assessment is ripe 
for new directions. 

The model presented above serves > mainly, for the enumeration of the 
many judgmental interfaces that exist in the generalized information storage 
and retrieval process. Unfortunately, this enumeration has but increased 
our awareness of the difficulty of deciding just what is relevant, and on 
what basis. 

The main inference to be drawn, apart from that of the chaotic state of 
relevance studies, is that systems evaluation must be centered on the needs 
of its users. That a user may not want att of the relevant references that 
a system can provide — perhaps the first one retrieved will be sufficient to 
satisify his informational^ need — should also be taken into consideration. 
This suggests that evaluation research should be directed tcward the 
correspondence between query and. document, but toward the identification of 
the attributes of the user’s goals. Judgment G of the model described in 
Section 4.1 must be an integral part of such research. 
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Cuadra 131] summarizes the problem: 

Nearly all studies purporting to evaluate the effectiveness 
of information retrieval systems have relied very heavily 
on the notion of a "relevant set of documents" identified 
by a particular set of judges. The relevance judging process 
in these studies has been treated largely as a "black box ) 
with little serious effort to understand what happens 
inside the box or how variations in the judgments might lead 
to variations in the "relevant set of documents." 

There is, then, a clear difference between relevance to a query and relevance 

to a need. This calls for the analysis of the searcher-receiver box of the 

model which represents the correlation between the expression of the need 

and the need. Figure 5.1 shows an adaptation of a model proposed by Kegan 

[32], and indicates some of the pertinent factors of this correlation. It 

is assumed that a more nearly complete representation and understanding of 

the above factors will aid in deciding what is information both to the user 

and to the corresponding relevance judgment. 

6. Interregnum 

The notable conclusion to be drawn from these introductory sections is 
that the field of Information Storage and Retrieval (IS&R) lacks a comprehens- 
ive theory of retrieval systems evaluation — an unfortunate circumstance since 
a solid theoretical foundation is essential for the characterization and 
evaluation of retrieval experience. Stated differently, an observer s 
experience in the real world has meaning only if he has a sound predictive 
model. The diffuseness of any existing predictive model (assuming that a 
model does exist) is indicated by several observations: 

• The problem of the definition of "systems evaluation" — e.g.j 
which of several possible definitions is the most useful? 

• The IS&R researcher is confronted by a wide range of definitions 
of relevance — which definition is of greatest value? 
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Figure 5.1: IS&R Decisions. 
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• There is an equally extensive collection of "measures" of 
relevance — which of these measures is meaningful, and in 
what sense? 

• One is plagued by indecision and confusion when presented 
with the various possible forms of relevance judgment — 
what is the meaning of all of these "judgments"? 

The most important observation that can be made, at present, is that 
evdluation is a judgmental relation, which is best characterized as a two 
place predicate or relation between a datum (or data) and a goal. One should 
not be constrained to think of this relation as a simple operator in a formal 
calculus, rather, it is hypothesized that the evaluation relation is a place- 
holder for a process or an algorithm connecting data and goals. Evaluation 
is viewed as a process which is best characterized by judgments C and D, as 
depicted in Figure 4.1.2. More explicitly: evaluation is a relational 

algorithm for measuring the strength of connection between the informational 
need and the retrieved doauments plus the expression of the need (see 
Figure 6.1). Retrieved documents are the retrieval results from an IS&R 
system and the expression of the need is the query presented to the system 
by the user. The main problem with this definition of evaluation is that 
the term "information need" is a fuzzy concept. The purpose of the following 
sections is to concretize the concept of information need. 

The Kegan model (see Figure 5.1) is a representation of the processes 
employed in the interaction (represented by the double arrow (-<->■) in Figure 
4.1.2) between the need and the expression of the need. Three important 
attributes of interaction are identified which are worthy of enumeration; 

• The user’s identification of the factors required for 
successful data use. 





User decisions. 
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Information Need 



Ri 



Expression of the Need 



R, 



Retrieved Documents 



Figure 6.1: Interaction Between the Need and the 

Expression of the Need, 
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• The perception by the user of what it is thjit he 
needs and his perception of what the system has 
to offer.* 

These attributes are not only useful in the description of this specific 
interface, but, as we shall see, they are also instrumental in the analysis 
of information need. System attribute identification, user decision making 
and user perception(s) are all terms that will assume increasing importance 
as we strive to better understand the need/expression— of—the— need interface. 
But it is already clear that attribute identification, decision making and 
perception represent complex and highly dynamic activities. 

Furthermore, inquiry may be represented as the progression, or iteration, 
of queries presented to the system. Thus, there is no reason to assume, 
a "priori f that the nature (form, status, mode) of the above-listed activities 
is invariant through several inquiry iterations with an T.S&R system. And, 
if we can accept the idea that information need is a dynamic concept, then 
we must ask how these activities arise, how they interact among themselves 
and how they contribute to the identification and use of "relevant" data. 

In what follows, an attempt will be made to characterize information 
need and system interaction through a consideration and development of the 
following topics: observation and measurement, experience, the central role 

of the process of inquiry, the goal as information need, the reason for the 
the existence of the goal, hypothesis testing, decision making, estimation 
of probability and information gain. 



* It is important to realize that these perceptions are not necessarily 
the same. These perceptual differences are the source of considerable 
error in retrieval system interaction. 
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7 . Infonnat.iQn Need 

7.1 The Problem Posed to IS&R 

The standard IS&R-oriented definitions of "information need" are nicely 
enumerated by O'Connor I33j . He identifies three broad meanings of the 
statement "satisfying a requester's information need": 

• Request negotiation good.' \e,g.t interactive procedures 
with the system have been successful — documents are 
provided] . 

• System provides the user with information helpful to his work. 

• System provides the user with "documents that he is glad 
to get." 

Although these statements are simplistic, they accurately characterize current 
thinking in IS&R system evaluation. Most measures and relevance j gments 
either have their origin in, or ultimately reduce to, a measure having one 
of these three "meanings". 

It should be clear that these three statements are prccedurally , or 
operationally, oriented— the problem of information need is not directly 
addressed. I choose, rather, to replace these general statements with a 
series of questions directed toward basic issues, the answers to which will 
yield a definition of information need: 

• Why is the user seeking information? 

• What creates a need for information? 

• How is the process of inquiry related to 
information need? 

• What is meant by Information? 

7.2 Gbservat 1 on arid Measurement 

Evaulation and relevance are assumed to be basic concepts.,., essential in 
studying the interaction of man with his environment. 

X ( (-■ 
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Man lives in a world of interaction and communication. Thus, we may 
accept as a basic premise that men and /or systems that exist in total 
isolation from their environments are without interest. Indeed, it may be 
argued that systems* of any kind exist within environments and must Interact 
with them. An observer of such interactions may record data that are 
transferred to or from the system and in that sense, may be said to observe 
actions taken or caused by the system. But in what sense does the observer 
actually observe these interactions? 

We note that a system tends to maintain ecfiitibrium** with its environ- 
ment. Now, if a system remains in equilibrium with its environment, then a 
new and uninitiated observer, who is told to "observe" the system for the 
first time, will have great difficulty in separating it from its environment. 
The point of this argument is that disequilibrium is essential to the 
observation of interaction. That is, if, for. some reason, a system is in 
disequilibrium with its environment, then it must effect potentially observ- 
able change to re-establish equilibrium. In effect, the change from dis- 
equilibrium to equilibrium corresponds to a system's entropy reduction 
operation, that is, the entropy of an effective system must be lower than 
the entropy of its environment. Since systems in which we are interested are 
presumed to be finite and to contain a finite number of states, the process 
of equilibration must involve changes in a finite number of states. These 
changes in state are what the observer is privileged to observe. Consensus 
about the reality of any such observations must depend, in the final analysis. 




* A system is defined as that portion of the universe chosen for observation. 

** Equilibrium is used analogously to its use in Thermodynamics . 
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upon the acceptability to others of methods by which matters of truth, 
principle, or any other justifiable grounds for shared belief are made evident. 
Under a fundamental rule by which Western scientists make the empirical world 
evident to each other, nothing exists that does not exist in some amount 
and is not, therefore, measurable. Granted this additional premise of 
shared belief about the world of science, then the ability of the observer 
to pevoeive a change of state depends on the precision of the measuring 
device(s) available to him. 

While measurement in these terms involves observation, it must also 
involve an operation — i.e.^ the assignment of a value to what was perceived. 

So grounded, each uniquely perceptible element of an observation may be 
assigned a unique number. Measurement, viewed in this way, becomes analogous 
to the action of a vccndcm VCOPiable. A random variable effer”^. a one-to-one 
mapping between the event space Call possible states of the system) and the 
real line. Thus, a random variable is a measuring device {of. Def. 2.1, 
Chapter IV) . 

In principle, the observer can observe any changes of state of the system 
under scrutiny (moments of equilibrium being infrequent). By this argument, 
he should be able eventually to develop a probability function that assigns 
to each measured observation a probability of occurrence. The probability 
function that is developed is the observer's subjective estimation of 
probability, since the observer is usually unable to observe the system for 
the extremely long periods of time required by objective probability. 

7 . 3 Interpretation and Extension of Experience 

To facilitate the previous discussion, it was assumed that the system 
and the observer were separate entities. Let us now consider the system, and 
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the observer to be the same, so that attention can be directed to the analysis 
of the system's experience with its environment. Of primary concern is how 
a human system (man) goes about creating order out of the apparent chaos that 
confronts it. Perception, observation, measurement, and the estimation of 
probability~of— occurrence are, as I see It, the initial factors required in 
this ordering process. Caws [34J has outlined the subsequent steps that 
man must take: 

• Step from a specific experience to correlation with 
prior experience. 

• Step from prior experience to knowledge of one's own 
particular world. 

• Step from knowledge of one's own world to knov/ledge of 
a world shared with other men. 

Knowledge is defined by Caws as "the ability to make true statements and 
defend them as true" [35]. Thus, the statement: "system X contains documents 
giving the boiling point of water at 10,000 feet of altitude" is not knowledge 
since it has not been defended as true — there is no indication that 
the system has been searched to find at least one of these documents. A 
statement indicating knowledge of the system would be similar to; "system 
X contains documents giving the boiling point of water at 10,000 feet of 
altitude, and document 973 contains those data." 

The third step is probably the most important because it gives rise to 
common experience. The validation of common experience is the first step in 
scientific activity, and the specialized inductive inference exhibited in 
the first two steps is manifested in the inductive inference of scientific 
method. Science assumes that a logical progression of inquiry will yield 
answers that asymptotically converge to an explanation of an "unknown". 
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Scientific activity and scientific method, represented by a presumed corres- 
pondence between empirical propositions and general theoretical propositions, 
are a logical extension or, perhaps mirror of basic human activity. 

Theory, as a by-product of scientific method, may be defined as a 
logical, probabilistic structure that is created from empirical propositions. 
But empirical propositions are the result of mecisurement which is, in turn, 
dependent upon a defensible identification of observables. Questions about 
theory (structure), thus defined, ultimately reduce to questions about 
observables, which are the elements of experience. 

7.4 Science as the Generation of Hypotheses 

Inquiry, as I have described it, results from observation and measure- 
ment. An alternate way of viewing the progression of inquiry is as a process 
of hypothesis generation and testing. A hypothesis about observables serves 
as a formal representation of the observer’s subjective estimation of the 
probability of an event (or group of events). We shall define a hypothesis, 
then, as any verifiable proposition that is not itself an observational 
statement. Consequently, the decision problem associated with adducing support 
for or rejecting a hypothesis amounts to finding a method or algorithm for 
discovering whether a well— formed >'.atement (a meta— observational formula) 
is refutable. 

Prior to the formulation of a hypothesis, an observational sentence and 
an empirical generalization must have been stated (or else must be 
susecptible of construction on demand). An observational sentence is defined 
as a sentence of observational terms joined by grammatical and logical 
connectives; an empirical generalization is of the form: all X’s are Y*s. 

O 



A hypothesis may, therefore, be defined as follows: "a sentence which has as 

a consequence at least one empirical generalization, but whose contradict- 
ory does not have the forrc of a protocol Jan observational] sentence” I36J . 
The following are examples wliich illustrate the meaning of these three terms; 

Observational Sentence; This retrieved document from system X 

satisfies my infonaation need. 

Empirical Generalization; All retrieved documents from system X 

satisfy my information need. 

Hypothesis; A retrieved document from system X will 

satisfy my information need. 

The cyclical process characteristic of the scientific method can now be 
modelled as Indicated in Figure 7.4.1. The closing of the cycle is provided 
by the observables that result from the testing of hjrpotheses. 

The verifiability theory of meaning [37], popularized by the Vienna 
Circle of the 1920's and 1930's, contended that a sentence was empirically 
meaningful only if it was verifiable. Thus, a sentence {e.g.^ a hypothesis) 
is relevant to some thing only if it asserts or denies something with 
respect to it. It is easy to infer that the relevance of a sentence to and 
knowledge about a thing are closely tied, if not identical. A sentence about 
a thing is irrelevant if knowledge of it cannot be obtained. 

The observational cycle is now complete. Observations of one's 
environment constitute measurement; measurement permits the foirmulation of 
empirical propositions (observational senttsnces) and these propositions yield 
hypotheses which, when tested, yield new observables or knowledge (or both). 

7 . 5 Information Acquisition Through Hypothesis Testing 
The observatlon/measurement/hypothesis-testing cycle also has an 
informational counterpart or explanation. Let us assume that the system under 
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Figure 7.4.1: The Scientific Method. 
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obssrvstioii C3n b6 in nny of Ti possxbl© stinti&s 0 ? * * * *^n^* Tbs 

observer observes the system because he is uncertain about the disposition 
of its component states (,e.g.f which states ace Itx estlstence) , Prior to 
his interaction with a system,* the observer must allow for the existence 
of any one of n different states (.n could be infinite). The observation 
of a change in the state of the system effects a partitioning of the initial „ 
states into those states which are observed to exist, and those states that 
have not yet been observed. Expressed differently, the reception of data 
by a system yields a reduction in the size of the set of a pvi-ovi alternative 
states of the environment; 

l3| > |3'1 

This reduction in the number of alternatives, or increase in certainty about 
the entity under observation, is information in the classical Maxwell/ 
Boltzmann sense. 

Thus, information is a reduction** of ignorance through an n~fold 
polychotomy of 3. Or, 

Ignorance (3’) < Ignorance (3) 

Information acquisition also amounts to the observer's certainty about the 
existence of any given state. This certainty is manifested in the subjective 
estimation of the probability of a state E^, P^^^(E^). Given a datum d^ 

(i.e.j one which occurs during the vth observation of a state) the credibility 



* We have already pointed out that interaction is effected through a 
change of state which yields observables, 

** 




This reduction in ignorance can only be effected through an expenditure 
of energy or negentropy (N) , We assume, following Brillouin 138], that 
AI - AN < 0. 
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of the hypothesis that event will occur is measured by This 

credibility is updated through Bayes' theorem: 

(V), , 

k 

Watanabe [39J defines an inductive entropy which is a measure of the 

uncertainty of the validity of the probability hypothesis: 

CE^) log (E^) 

i=l ^ 

His inverse H theorem states that ^ ^ \ or that the average uncer- 

tainty of the probability of a given state monotonically decreases with each 
new observation. This reduction of uncertainty may be defined as generalized 
learning. 

The continuous decrease of U and the convergence of provides 

support for the observer’s hypothesis that P(Ejj^) is the true value. If 
P(E^) does not converge, then the observer must adopt a new estimation. 

P'(E^) of the probability of occurrence of state E^j^. In effect, he must 
forrasilate a new hypothesis in the face of conflicting data. 

The observer's information processing and hypothesis testing activities 
may be explained as an attempt to reduce the undertainty about, and the 
number of attributes (states) of, the system under observation which must 
be processed. Knowingly or unknowingly, that is, the observer attempts to 
eliminate redundancy through "information" reduction* **. As will be shown, 
efficient information processing is assumed to be a prerequisite to the 

* The convergence of P^^^ means that P^ ^ ^ ^^i^ ” where e is small. 

** This is analogous to Posner's I4Q] information reduction, conservation 

and transformation. ^ " 185 
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observer’s cognitive and recognitive tasks. 

7.6 Hypotheses and Hypothesis Testing 

To this point we have been concerned with the role of hypothesis test- 
ing in observation, information processing, inquiry, induction and in the 
estimation of probability. But we have not investigated how hypothesis 
formulation and testing forms an integral part of man’s problem solving 
behavior. Let us state the fundamental postulate of this section, and, 
indeed, of generalized information need, namely that aVL aetion and thought 
are hosed on the testing of hypotheses.'^ This is a bold assumption, but 
it will be shown to be inst 3Utal in the understanding of "information 
need. " 



Minsky gives a first clue to the nature of an observer’s model of the 
environment: 

The problem solving abilities of a highly intelligent person 
lies [sic] partially in his superior heuristics for managing 
his knowledge structure and partially in the structure itself. 

These are probably somewhat inseparable. In any case, there 
is no reason to suppose that you can be intelligent except 
through the use of an adequate, particu"! ir knowledge or model 
structure. [41] 

A man’s model of the world is a distinctly bipartite structure: 
one part is concerned with matters of mechanical, geometrical, 
physical character, while the other is associated with things 
like goals, meaning, social matters. [42] 

He assumes that a model of the world is based on an individual knowledge 

structure, or cognitive structure. Minsky finds it convenient to partition 

this structure into two parts, each part (or, perhaps, each process) dealing 



* One may make a case for a distinction between purposeful and non- 
purposeful (involuntary) action. While such a distinction may ultimate^- 
ly be helpful, it will not be pursued here. 
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with a different aspect of the environment. The physical processing part 
of the structure deals with the raw data inputs from the environment ^ while 
the goal-^diTscbed part is man’s reaction to, and interaction wich, the 
environment. The former process we choose to call the concurrent hypothesis 
structure and the latter the actively hypothesising structure. Thus, there 
are postulated to exist two distinct levels of hypothesis testing within man’s 
model of the world. 

7.6.1 Concurrent Hypotheses About the Perceived World 

Information that the observer obtains from the environment (e.g-.j the 
polychotomy of g, see Section 7.5) permits him to estimate the probability 
of occurrence of any observed state. This perceptual information we assume 
to be manifested in the creation of ’’constancy" hypotheses about the states 
of the environment. These hypotheses are either the observer’s estimation 
of probability or, if = 1, his observation of a continuously occupi- 

ed state. Since no action is required of the observer in these cases, that 
is, since he is in equilibriim with the environment, with respect to the 
states in question, these "constancy" hypotheses are relegated to non- 
attentive processing. In other words, as long as sensed data appear to 
support the hypotheses, no action, or conscious attention, is required of the 
observer . 

The reception (acquisition) of negative data elements , or data elements 
that do not support one of the set of concurren'w hypotheses, throws this set 
of hypotheses out of equilibrium. If this is the case, the observer must 
either change the probability associated with the hypothesis, formulate an 
alternative hypothesis, or both. In any event, information need is an 
expression of the observer’s need to acquire more data, to create new 
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hypotheses and to observe the environment with renewed attention. This view 
is analogous to Harmon’s I43j interpretation of information need; 

"Information needs might be viewed as products of change within a system of 
personal constructs." 

7.6.2 Active Hypothesis Testing 

Human intellectual activity, according to my argument, reduces to the 
active creation cind subsequent testing of hypotheses. I assume hypothesis 
testing to be a central mechanism for updating cognitive structure. The 
terms "cognition" and "structure" are employed to suggest the mental processes 
of data acquisition and ordering. 

Cognitive structure is viewed as an ordered collection of data eZements 
and of relations between data elements. For purposes of analogy, cognitive 
structure is taken to be similar to Quillian's [44] data structure where 
nodes are words and linkage indicates a relationship between words. (A 
similar idea was earlier proposed by Bernier [45]). Although information is 
assumed to be obtained from the partitioning of the event space, information 

i 

is also -postulated to exist as a context-sensitive: structure (see Ernst and 
Yovits [46]) that represents both an imposition of order upon things **known" 
to exist and an order of observation of data from the environment. One 
part of cognitive structure is thus assumed to be a representation of what 
the observer has observed in the environment. We may conclude that cognitive 
structure is an observational thaorettaal tndex of the perceived environment. 

' In addition to the indexing function, another function of the cognitive 
structure is presumed to provide a site for active hypothesis testing. This 
hypothesis testing takes the form of assumptions about the existence of as 
yet unobserved data elements and of relations between them. Two cases for 

-iO p 
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hypothesis generation arise; 

case 1: A relational structure exists but the value of a data 

element locus is unknown. A hypothesis is formed about 
the existence of such a data element, and the environ- 
ment is observed — this is inferred to be a manifestation 
of ■information need, 

case 2: A hypothesis is formed about the existence of a relation 

between data elements, or about the existence of a specific, 
unobserved data element in relation to a known data element — 
this is also inferred to be a manifestation of infoW2atior 
laeed. 

In both cases, information need is interpreted to be the expression of 
a need to provide support for a hypothesis. Negative support vill create a 
disequilibrium in the existing collection of hypotheses and will require 
continued data acquisition and new hypothesis formilation on the part of 
the observer. Positive support may lead to an end— state, or goaty of the 
ongoing process of inquiry. 

In both the above cases, we assume an observer and decision maker who 
acts "rationally" according to our model. However, experience indicates that 
quite often "irrationality" (or alternative "rationality") prevails — e.g.^ 
the acceptance of a hypothesis is based on a definition of the situation 
that our model does not escribe. A "favored" hypothesis persists, despite 
what we should expect, as a complex union (or intersection) of simple 
hypotheses. One explanation of this phenomenon is that credibility of the 
favored hypothesis is achieved through a form of transitive logic over the 
sub-hypotheses. For example, hypotheses A,B, and C have been observed to 
be supported; hence, I), the favored hypothesis, a compound of hypotheses A, 

B, and C, is also assumed to be true. The "irrational" px ,essor, wh- t 
fails to act as a perfect information processor in our view, will continue 

Er|c 133 
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to except hypothesis D* in the light of what we ^*know” to be conflicting 
data — usually until S5ome of the co/Jiponent hypotheses are demonstrated to be 
false. ’*Error** of this type is viewed as a temporary deviation from 
intellectual hypothesis testing as our model prescribes for it. 

7.7 Some Information about ” Information Need*' 



We are now ready to answer the four questions posed in Section 7.1. 

The answers to these questioxis provide a convenient summary of the material 
presented in the preceeding sections. 

• Why is the user seeking information? 

The user (or IS&R system observer) seeks information 
for the testing of hypotheses that he has about the 
data elements contained in the system. 

• What creates an information need? 

Inforxaation need is created by either a) hypothesis 
disequilibrium, or b) active intellectual hypothesis 
testing. 

• How is the procedure of inquiry related to information need? 

The process of inquiry is the scientific method, a 
cyclic progression through observation and measure- 
ment, generalization and hypothesis testing. Problem 
solving behavior is effected tu cough hypothesis test- 
ing. 

• What is meant by infr rmation? 

Information is defined as the reduction of uncertainty 
derived through partitioning of the event space. 

Information is also defined as acquisition of data 
suitable for hypothesis testing — e.g.^ data of value 
in decision making. 



* That is, if A: System X contains data on metling points, 

B: System X contains data on titanuim compounds, 

then s»^> System X contains data on melting points of titanuim 

compounds . 
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The following sections will deal with the role of information need 
and hypothesis testing in human behavior, with a formal model of hypothesis 
testing in information retrieval, and, finally, with a reconsideration of 
the concept of relevance ♦ 

8 . P roblem-Solving and Decist on ->Ma king Behavior 
8J Introduction 

In the initial sections of this chapter, it was pointed out that a 
thorough understanding of the often used phrase *'IS&R Systems Evaluation’* 
demanded a prior and careful consideration of the term evaluation. It was 
postulated that evaluation was a judgmental, or correlational, relation 
between retrieved data and the user’s information goal. Although vetr^ieVed 
data is .quantifiable and is amenable to analysis, an inf ox^iatiovxxl goal is 
recc .ized to be a qualitative, subjective concept. The argument was 
presented that progress toward quantification of an informational goal could 
be achieved through a detailed analysis of the concept of the user’s 
information need. 

Information need is characterized as the impetus to a pTooess of 
inquir^y involving measurement, observation, information acquisition, the 
investigation cycle, hypothesis testing and decision making. Although the 
theoretical discussions that have been presented appear to be consistent, 
experimental testing of the derived hypotheses is required. The sections 
that immediately follow point toward such testing by means of an analysis 
of both models and behavorial investigations of human problem solving. 

8.2 Problem Solving as Inquiry 

The prasentatl^ . of the scientific method, as outlined in Section 7.4, 
involved the description of a progressive shift from an individual’s personal 
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experience to experience tliat can be generalizei. This shift, or trans- 
ition, is the essence of Caws' three— step model, which terminates witn the 
acquisition of knowledge about an observer's environment. The cyclic process 
depicted in Figure 7.4.1, which is by itself the formal definition of a 
hypothesis, is also a description of the transition from individual experi- 
ence to shared knowledge. The feedback of observables from the testing of 
a hypothesis closes the investigative loop of the scientific method and 
implies that the scientific method, as a hypothesis testing cycle, is an 
open ended process. This posited hypo thesis- testing cycle also provides a 
convenient representation of problcMH-solving behavior. 

Recall that obaarvation and measurement are effected through the 
perception of disequilibria in the observer's environment. It follows that 
the hypothesis-testing cycle represents a progression from an observer's 
initial response to disequilibrium, through a series of if— then relationships 
(hypotheses) to a solution. This form of problem solving was recognized by 
Dewey in his principle of ocmti-rmim of encfi-iryi "The conclusions reached in 
one inquiry become means, material and procedural, of carrying on further 
inquiries" [47]. The idea is often overlooked, however, that the feedback 
of observables itself may contribute to a structuring or patterning of the 
inquiry. Such patterning helps to account for the existence of the relation- 
ship between the object of the investigation and the manner of the. inquiry. 
This is recognized in Russell's "structural postulate" of Scientific 
Inference [48] or in IiJhitehead 's "grouping of occasions" [49]. This form 
of inquiry and behavior is also identifiable in the vocabulary of some 
psychologist';, as a Gestalt. 
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In the Gestalt view behavior is a pattern or configuration of action 
and the linkage of ideas associated with problem solving is progressive inquiry. 
With such an emphasis on the organization of ideas, it is believed that 
"psychological organization" tends to move toward the state of Pragnanz — 
i.e.j toward the good Gestatt.* This view of things is consistent with the 
idea of a continuum of enquiry as an entropy-reducing operat n (see 
Section 7) . 

TTooe theoTy , often associated with Gestalt formulations posits a 
stochastic representatioi^ of the subject's past in the characterization of 
his present. Information gained from the test of a current hypothesis is 
assumed to be mediated by information obtained from previous hypotheses. 

This is one elaboration upon the idea that what is information (and, by 
implication, meaning) to a user is highly context sensitive. 

Achievement of a "good" Gestalt implies that problem solving behavior is 
directed toward an end situation which brings closure with it. The person's 
desire for completion of a t \sk emphasizes the importance cf his acting as 
if inquiry had an attainable goal . According to Dewey ; "The nature of the 
problem fixes the end of thought, and the end controls the process of think- 
ing" [50] . Although I do not propose to view hypothesis-testing and problem- 
solving behavior as Gestalten^ this idea of things is useful in placing 
emphasis on the organization and structure of inquiry, the utility of acquired 
information, and the relevance of goal definition to a sense, of closure when 
an end state has been achieved. 




* Characterized by the laws of similarity, proximity, closure and 
continuity. 
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8.3 Problem Solving Models 

The first modem model of progressive inquiry in problem solving was 
developed by Dewey. He believed that a problem should not be characterized 
as i crisis or by the obtained solution, but, rather, as an Inquiry sequence. 
This view is reflected in Dewey *s five problem-solving steps: 



• a difficulty Is sensed, 

• the difficulty is located and defined, 

• pof^sible solutions are suggested, 

• consequences are considered, and 

• a solution is accepted. 



Notice that 'information needy hypothesis testing and decision making are 
implicit in all five steps. 

Recently Guilford £51] has presented an extension of Dewey ^s conception 
of problem-solving. This model, depicted in modified form in Figure 8.3.1, 
is a process model and exhibits the sequential ordering of data inputs, 
attention, cognition, production, evaluation and memory. The initial input 
and attention operation corresponds to Dewey’s first step; the first 
cognitive* operation corresponds to his second step; the production of a 
tentative answer corresponds to the third step; the second cognitive 
operation is analogous to the fourth step; and, the final prodxiction may be 
equated with Dewey's fifth step. It is important to realize that the final 
production (and adopted solution) may only be achieved after se'.veral 
cognitive/productive Iterations . 

* Guilford [52] defines cognition as awareness, immediate discovery or 
rediscovery, or recognition of Information in various forms# 
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Figure 8.3.1: A Problem-Solving Model* 
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The initial attention operation Cin conjunction with the evaluation/ 
memory operation) corresponds to what we have chosen to call the hypothesis 
disequilibrium, while the several cognitive operations and the evaluation 
operation correspond to active hypothesis testing. If we conceptually 
separate these two forms of hypothesis testing, and remember that each one 
is tied to information need, Guilford's model can be reduced to the three 
stage model depicted in Figure 8.3.2. The recursive nature of this model, 
and its application to the analysis of Information Storage and Retrieval 
problem solving, will be considered later in this chrpter. Attention is now 
directed to the analysis of hypothesis-testing behavior. 

Investigation of problem-solving behavior is difficult because the 
central postulates remain, as yet, untried. Bourne [53] provides us with a 
hint of this problem: 

We can only infer that a decision-making process exists 
in problem-solving - [we] search for something only 
rumored to exist, but so indescribable that we cannot 
even tell when [that something] occurs. 

One of the central postulates of the hypothesis or, as they are sometimes 

called, process theories is that in a problematic situation, a person (subject) 

entertains at least one hypothesis. Thus the stimuli that the subject 

receives (be they controlled or random environmental inputs) provide a test 

of the hypothesis (es) under considera^-i on. Observed df '.a, following 

validation with respect to the hypother^**- leads to acceptance, reje. tion, 

or revision of the hypothesis. 

It should be clear that active hypothesis testing, especially as depicted 
in Figure 8.3.2, involves input both from the exemplar (stimulus) and from 
the subject's environment. This input provides the data, subsequently 
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informations recognized by the problem solver. Hypothesis-testing behavior 
thus defined is an attempt to reduce disorder that is sensed in the problem, 
through a reduction in the number of degrees of freedom associated with the 
problem. It is inferred that the information that is obtained, either as 
input or as the result of hypothesis testing, serves to partition the set 
of alternative hypotheses and, consequently, to provide further information 
about the problem. This partitioning results in the information gain 
discussed in Section 7.5. 

The number of hypotheses that are adopted the number of 

iterations through the model of Figure 8.3.2) is assumed to be a function of 
che number of attributes associated with the problem. This view of problem 
solving implies that the problem solver is able to identify correctly the 
pertinent attributes of the problem. In hypothesis testing, the subject's 
major freedom comes from the rich domain of hypotheses from which he can 
choose. However, the problem usually structures the order of occurrence of 
the instances data inputs) encountered. As we shall see, efficient 

problem solving demands not only a person’s choice of hypotheses but his 
complete identification of the attributes involved. Experimental data on 
attribute identification and information processing in problem solving will 
be useful for the characterization of IS&R retrieval operations. 

8.4 Attribute Identif ication and Problem-Sol ving Strategies 
In the preceeding sections, we have emphasized the importance of the 
act of inquiry in a problem solving situation. Several models have been 
presented that define inquiry as a multi-step process. Common to all of the 
steps involved in these processes is the process of hypothesis testing. 

Thus, all data and information that the problem, solver directly encounters 
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are presumed to Involve either implicit or explicit hypothesis testing. 

As has been previously emphasized, both forms of hypothesis testing are 
dependent upon an information need which creates a demand for new data, a 
new hypothesis, and so on. . . 

How the problem solver handles and processes newly encountered data 
depends on several factors. These include the previous data encountered, 
hypotheses that have been tested, and information that has been gained. 

This means that the interpretation of new instances is a function of previous 
conceptualizations. One aspect of interpretation is posited to be the 
learning of a rule that can be applied to the analysis of successive data 
inputs. The other aspect of interpretation is assumed to be an ability to 
identify the attributes, or variables, that characterize the problem at 
hand. The Bruner, Goodnow and Austin £54] definition of attribute is 
adopted for this discussion: "an attribute is any discriminable feature 

of an event that is susceptible to some discriminable variation from event 
to event." 

When a problem has been solved (or a concept attained, in the Bruner 
sense) we shall say that the subject has identified those attributes and 
rules which enable him to classify and act upon any future instances 
encountered which are pertinent to the problem situation. There are many 
ways in which a problem solver might go about obtaining a solution to his 
problem. We shall describe these various methods of solution as strategies, 
i.e.^ a sequence of decisions involving the acquisition and utilization of 
Information for the achievement of a well-' defined goal- A strategy is 
presumed to be adopted for several reasons: to minimize the number of 

iterations through the model in Figure 8.3.2, to minimize the subject's 
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‘'information overload’% and to minimize error in decision making. 

Strategies may be evaluated by their efficiency in eliminating alternative 
hypotheses concerning the attributes which are pertinent to the solution 
of the problem. Within these constraints, the number of iterations requir- 
ed to reach a solution is dependent on the number of attributes that must 
be correctly identified. 

Bruner, et [55J have identified four basic problem solving 

strategies : 

• Simultaneous scanning 

• Successive scanning 

• Conservative focusing 

• Focus gambling 

Briefly, simultaneous s<^gnning may be defined as the simultaneous testing of 
several hypotheses about attribute importance; successive scanning is the 
testing of a new hypothesis with each successive instance encountered; 
conservative focusing is the orderly testing of attributes, and involves the 
use of only one attribute as the independent variable for each hypothesis 
tested (this strategy is, on the average, an optimal strategy); focus 
gambling reduces to the adoption of a “favored hypothesis”, as described in 
Section 7.6.2. 

In the following section we will consider the nature and importance of 
attributes in an Information Storage and Retrieval environment. Attention 
will be directed toward the ication of the problem solving steps in 

>he retrieval/ search interface as a prelude to the description of an 
extension of the classical Bruner conjunctive-concept experiment. 
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8.5 Attributes and the Retrieval Interface 

The retrieval interface of an IS&R system presents a unique problem 
solving situation. The searcher (system user) has as his predefined goal 
the extraction of a subset of the data elements in the system data base in 
order to satisfy an information need. The searcher brings, initially, only 
two capabilities to the task; 1) his own cognitive structure and reason- 
ing ability; 2) his perception of the search interface. These two 
features are presumed to determine the particular behavior pattern that the 
searcher will display. However, as Reitman 156] cautions, the observed 
behavior (in terms of strategy and resxilts) will be less than ideal: 

In real conflict and cooperation problems, the conditions 
for game theoretic solutions are rarely met. We know 
neither the full set of alternatives, the states of the 
world, nor our opponent's (e;g., the IS&R system's) 
perceptions and evaluation of them. Bluffy deceit, and 
efforts to influence and persuade are possible ecause 
and only because this is so. If the relevant zts about 

the world, the alternatives, and the payoffs v i known, 

there would be nothing to deceive or persuade out. 

In such a situation, the controlling variable appe= s to be the precision 

of measurement implicit in the user's hypothesis (cognitive) structure ~ 

the precision of measurement associated with the definition of the component 

dwtci m This precision is dependent on both previous experience and 

on general kixbwledge about the sub j ect area encompassing the information 

need. Saracevic, in his relevance measurement studies, has shown that the 

more desperate a user is for information, the more relevant everything / 

becomes for him 157] - ^*9*^ the user lias a low precision of measurement 

demanded by with his data element definition. This type of observation places 

emphasis on the value of the user's a subject knowledge (what the user 
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brings to the search interface); indeed: "The less we know, the more 

everything becomes relevant, and the more we know the more stringent we 
become in our Judgment [58]." 

In addition to a lack of subject knowledge (which is an expression of 
his information need) , the searcher has a less-than^perf ect knowledge of 
the sy?^jtem*s 'index syace. This means that a searcher may not understand 
the system^s operating characteristics in terms of the tvariPm'i^S'ion deood-- 
ing^ language and vooabuloJ[*y variables. By example, and considering only 
the language variable, a searcher may be able to find data elements by 
reference to a specific transmission decoding element a subject 

heading), but he is not knowledgeable' of the set of relations the 

search language) available for modification and direction of a search. 

Although we have discussed some of the attributes essential to 
efficient IS&R* problem solving, all attributes can be grouped into four 
broad classes : 



• The elements of the index space 

• The number of access points to the data 

• The rules of the crosS“-ref erence language 

• The range of data elements (and associated 
documents) that are available 



It is postulated that effective search requires that the user obtain, as 
soon as possible, knowledge and mastery of these attributes- 

The problem-solving steps involved in the search interface are 
assumed to be essentially those described in Figure 8.3.2. Satisfaction of 
the searcher *s information need is termed the goal of the problem-solving 
activity; his query, as presented to the system is actually a hypothesis 
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about the nature of the data elements in the system. The data elements 
that are retrieved serve both to test this hypothesis and to permit the 
searcher to decide whether his goal has been achieved. The retrieval and 
testing of data elements may serve as a basis for modification of the user^s 
hypothesis structure. This modification may be inferred by an observer 
from a new query by the searcher; it may also be inferred that he has 
acquired a different expectation of the typefe of data elements to be 
retrieved. Saracevic has confirmed the existence of this from of behavior 
[59]: ’'Items judged as non^relevant tend to remain as such; items judged 

as relevant are subject to change following iterations with the system.*’ 

Search feedback is essential to the solution of any retrieval problem 
and to the satisfaction of a searcher ^s information need. Feedback enables 
a user to obtain information about the system. This information takes the 
form of data that enable him to decide which system attributes and user 
hypotheses will be effective for the achievement of his goal. Frequently, 
information obtained will create, through a modification of the user’s 
hypothesis structure, an alteration of his information need. However, it is 
not clear just how effectively the searcher can process feedback information. 
Some evidence is needed for how efficiently human problem solvers can identif^^ 
and utilize attributes iii the solution of a problem. A ’’relevant” 
experimental investigation is described in the next section. 

8 . 6 Experimental Investigation of Attribute Processing 
The subject in the Bruner Concept Attainment Experiment [60] is tested 
for his ability to achieve a fixed conjunotvve oonoept. A conjvmctive concept 
’’The joint presence of the appropriate value of several 
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is defined as: 



attributes.'* The simplest case would be when the concept was composed of 
only one attribute*'''^.^^. ^ apples, squares, etc. A variation of this basic 
expex'inient could center about the achievement of a fixed coTic&p'b'^'^ 

attribute(l) ot attribute(2) or, .... 

As depicted in Figure 8.6.1, the subject Is provided with an array of 
instances (data ele,ments characterized in terms of attributes and atti*ibute 
values') to be tested in order to attain a concept. With each instance 
encountered or identified, the subject must decide whether the instance is 
an example of the concept sought. A brief examination of Figure 8.6.1 will 
show that the subject is presented an ordered array of 81 instances construct- 
ed from four attributes (border, coj.or, number, shape) each of which may 
take on three different values.* After an initial exemplar of the concept 
has been presented by the experimenter, the subject is instructed to use a 
quest: ion- answer technique in an effort to discover the chosen concept. 

It should be noted that in the Bruner studies the subject is given the 
array of instances as a problem-solvii.ig aid. Because the array is 
systematically ordered by attribute values, the subject is more likely to 
perceive the set of attributes involved than when the instance array is not 
so well ordered. Finally, the subject is informed both of the definition of 
a conjunctive concept and of the procedural rules of the experiment. 

We have previously noted that the studies of Bruner, et,at.j have 
identified four general problem-solving strategies. The usually optimal 
strategy (conservative focusing) relies on the subject’s systematic variation 
of one attribute while holding the remaining three constant. In such a case, 



* 1,2, or 3 borders; red, green, black; 1,2, or 3 objects; square, 

circle, cross. , ; . 
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the info mat ion received for testing successive hypotheses about which 
attribute is the concept is maximal since it pemits a minimal partition-^ 
ing of the space of instances. Four general results have been obtained 
from these information processing studies J 61 ] ; they are listed below. 

• Strategies can be described both in terms of goal 
and by the problem-^solving steps. 

• In the absence of new information the subject will 
fall back to the testing of previously useful cues. 

• Subjects may fail to use information arising out of 
negative instances or indirect tests. 

• Subjects frequently fail to assimilate as much 
information as is potentially available from the 
testing of an instance. 

Unfortunately, the real-world is seldom structured in the way problem 
situations are structured in the Bruner experiment. Very seldom does one 
possess complete knowledge of the collection of attributes involved in a 
problem-solving task with which he is confronted. Thus, an initial step in 
a systematic solution (and prior to the adoption of an ideal strategy) of a 
problem is the identification of the set of variables or attributes involved 
in the problem. It is believed that whether the problem is the identification 
of a CO , the location of a book in a library, or the retrieval of data 



from an 'IS^R system, the problem-solving steps are essentially the came. 

Thus, one of my goals has been to see if the general conclusions about problem- 
solving behavior of subjects in the Bruner, studies (see above) at-Lse 

when a subject is supplied an unstructured task, characterized by the 
attainment of a conjunctive concept as goal, together with a randomized 
Bruner instance array Cwith each instance separated from the others, and a 
mindlmum of procedural information. Given an unstructured task (which is not 
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dissimilar in nature to that of an ISSR system interactive interface) I 
want to know whether a subject is able to 1) perceive the attributes 
involved, 2) attain the concept, 3) utilize a recognizable strategy, and 
4) perceive the information content of the "informative displays" supplied 
to him during the course of the (attempted) task achievement. 

8.6.1 The Ex te nded Bruner Experiment 

1 have previously emphasized that, in everyday life, one rarely has 
complete knowledge of the attributes associated with a given problem. 
Consequently, to solve a problem, one must first identify its pertinent 
attributes. Early and accurate attribute identification is prerequisite to 
obtaining an efficient solution to any IS&R problem. The purpose of the 
e 3 cperiment described below was to determine if potential problem solvers (Ss) 
could identify the attributes associated with a problem and use them to 
achieve problem solution. This experiment is based upon several of the 
central assumptions of Hypothesis Theory [62] . Six governing assumptions are 

, • Ease of problem solution is directly related to the degree 
of structure (-i.c., number of attributes, lack of ambiguity, 
and clarity of attribute presentation) associated with the 
problem. 

/ • In a problem-solving setting, a subject's behavior results 

/ from his hypothesis testing when concept attainment is 

j requisite to problem solution. 

• A subject (S) enters any situation with some preconceived 

/ hypotheses (Hs) about what is to occur and about v;hat 

behavior he is expected to exhibit. 

' '*1 

• S has an initial set of Hs from which he samples until he 
selects the correct H Cor else gives up) . 

• S modifies his initial hypothesis with increasing experience. 

• If the situation is unstructured the problem is 

presented in an ambiguous or unclear fashion) S will attempt 
to impose a working structure impute logical relation- 

ships among elements considered to be pertinent attributes) 
based on his own experience. 
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The reader will recall that in the Bruner Conjunctive Concept Game, 

S was provided an ordered array (see Figure 8«6,1) of instances to assist 
him in solving the problem. The S was given the definition of a conjunctive 
concept and ample procedural rules for the condi^ct of the game. The play 
was started with selection, by E, of an exemplar of the concept which was 
then presented to S. S would then select (following an initial hypothesj.s 
concerning the nature of the concept) an instance from the array and ask E 
if it contained the concept; E would then answer YES or NO. In the extended 
Bruner experiment the subject was presented with minimal structure, sparse 
procedural definitions, and little information as to how he was to behave. 
Instead of presenting S with an ordered 'array , each of the 81 instances was 
placed on a separate card and the resulting deck of cards was randomized. 
Thus, in the extended Bruner experiment, the exemplar and the array were 
presented, respect-' card and as a randomized deck. 

8. 6 . 1 . 1 The wions and Conduct of the Experiment 

In this experiment the S’s only introduction to the problem consisted 
of the following statement read by Es 

• I am interested in how people solve problems. I am particularly 
interested in the processes people use. In fact, I am more 
concerned with what you do in trying to find the answer, than 
with whether you are able to find it. I have manufactured a 
problem for you and assume you have as your goal the solution of 
this problem. 

E then reads and demonstrates the following instructions; 

• Here is a card that has some objects on it [E places the exemplar 
card before S] . I have an object in mind I the concept was ’’square’* 

.208 
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throughout the experiment] . Your job is to identify what I 
have in mind# 

• Here is a deck of cards lE places the deck of cards before Sj • 

Each card in this deck is similar to the card I have shown you. 

You may use the deck of cards to help you identify and name what 
I have in mind. 

^ 1 will record what you do in solving the problem, but will not 

help you in any way. Over here, however lE points to his 
assistant], I have installed a helping machine. The helping 
machine will answer any question you have with a YES card, a 
NO card or a BLANK card.* 

Finally, E places the instructions beside S and begins to record data. The 
recorded data included: the question asked by S; the helping machine's 

response; the number of times S read the instructions; the number of times S 
sequentially scanned the deck of cards. The experiment was terminated when 
(1) the S realized he had obtained the solution ^ (2) the subject announced 
he had quit attempting to solve the problem or (3) after fifteen minutes of 
play had elapsed. 

8 . 6 . 1 . 2 Results and Discussion 

Experimental results obtained from. 48 subjects (a mixture of senior- 
high school, undergraduate and graduate university students) are presented in 
Table 8.6. 1.2.1. Column A shows the overall average number of questions 
asked by the Ss, the average number of yes, no and blank answers, the average 

* Any non-compound question about an attribute of the deck was answered 
YES or NO; any cthr.r question was answered by a BLANK, card. 
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COLUMN A 

Sample Size: 48 Ss 




COLUMN! B 

14/48 = 29% Had a Theoretical 


Average values 




Solution 

Average values (up to 


theoretical) 


Number of Questions: 


20 


Number of Questions: 


20.5 


Number of Yes's : 


3 


Number of Yes's : 


3 


Number of No's : 


5 


Number of No's : 


7 


Number of Blanks : 


12 


Number of Blanks : 


10.5 


Numb e r of t ime s 
instructions read : 


3 


Number of times 
instructions read : 


2 


Number of times 
deck sequentially 
scanned ; 


2 


Number of times 
deck sequentially 
scanned : 


1.6 


Number of Solutions: 


10/48 = 21% 


Number of Solutions: 


4/14 = 28% 


Strategy: 

Focus Gambling: 6/10 

Other (unidentified): 4/10 




No Theoretical Solution Achieved: 


34/48 




Average number 


of attributes eliminated: 0*6 




22 Ss 


eliminated 0 


attributes 




5 Ss 


eliminated 1 


attribute 




5 Ss 


eliminated 2 


attributes 




2 Ss 


eliminated 3 


attributes 




3 ^ 


Table 8. 6. 1.2.1: 


Results of 


the Extended Bruner Experiment, 
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number of times S read the instructions, and the average number of times 
that the deck was sequentially scanned by S. Twenty-one percent of the Ss 
actually obtained Ctealized that thay had) the solution to the problem. It 
should be noted that 60% of the Ss who obtained the solution utilized a 
focus gambling strategy. 

Column B presents data on those Ss C14/48) who exhibited a theoTetioat 
solution. Theoretical solution is an analytical construct that indicates 
a point in the S’s protocol when, by means of the information potenti".ily 
available from the questions and answers, he has eliminated all but the 
correct attribute. The word "theoretical'' is used because S did not realize 
that the solution could be obtained from the available Information. It is 
interesting to note that the average values up to the point of theoretical 
solution are very close to the average values of the, entire experimental 
group (column A). In other words, those Ss exhibiting a theoretical solution 
had an above-average total number of questions and answers. Any information 
acquisition activities beyond the theoretical solution represents redundant 
data processing on the part of the S. Furthermore, only twenty— eight percent 
of those Ss exhibiting a theoretical solution actually obtained the correct 
solution. Finally, the 34 Ss that did not exhibit a theoretical solution, 
on the average, eliminated less than one attribute. 

Although some of the Ss obtained the solution, the majority of the Ss 
failed to properly identify the necessary attributes. Following de-brief Ing 
sessions, it became obvious that although many Ss could name the attributes 
of the deck, they failed to understand the relationships between what they 
observed in the deck and the solution of the problem. The failure to relate 
what they observed in the deck to the Instructions given by E for problem 
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solution, partially accounts for the fact that so few subjects actually 
attained the solution. It is concluded, that in such an unstructured 
problem-solving situation where the stimuli appear ambiguous to the S, the 
concept of strategy has little meaning. With an incomplete understanding 
of the nature of the attributes involved in the problem, a subject is hard 
pressed to adopt an *'optimal’* strategy for the methodical testing and 
elimination of attributes. Although it w^ clear that Ss were repeatedly 
testing hypotheses from some predefined set, their apparent lack of problem 
structure permitted only a focus gambling strategy. 

A subject's failure to identify the pertinent attributes, adopt a useful 
strategy and attain the concept suggest that in a relatively unstructured task 
Ss, with fe.w exceptions, were unable to extract all of the potential infor- 
mation contained in the informative displays (questions and answers) of 
the play. The fact that only four of fourteen Ss who displayed theoretical 
solutions achieved actual solutions supports this observation. Furthermore, 
for many Ss, the successive modification of hypotheses seemed to be confused 
by a recurring uncertainty of the procedural rules of the interaction. This 
was reflected in the proportionally greater number of blank responses, than 
of yes-no responses given. Frequently, Ss would fall back to previously 
confirmed hypotheses (by asking the same questions) in an apparent attempt 
to validate the consistency of the helping machine, or to explore the 
conditions of a yes or no response when confronted with a perplexing and 
non-useful sequence of blank responses . 

Although this investigation was not designed to represent completely an 




IS&R interface, it is believed that the experiment does reflect and highlight 
some of the important features of information retrieval problem-solving 
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interaction. The results of this experiment support the hypothesis that 
effective interaction, information acquisition and processing and subsequent 
goal achievement, are related to the amount of problem solving structure and 
aids available to the subject or user. Some subjects exhibit problem solving 
capabilities even in the most unstructured and ambiguous situations, however 
good systems design must provide the necessary structure to stimulate 
’’natural” problem-solving capabilities. 

9. A Hypothesis Structure Model 
9.1 Introduction 

Most studies in man/machine interaction assume, as a starting point, 
that the use of a computer will e ance human creativity by providing signifi- 
cant insights into the solution of the problem under consideration. This 
form of interaction is often referred to, in biological metaphor, as a 
symbiosis. With such an optimistic view of man/computer interaction, one 
would expect the IS&R search interface to reflect these desirable qualities. 
Unfortunately, experience provides little data to supp ort this expect atio'^' 

Sackman*s studies of man/ccmputer problem solving [63] have revealed 
that only 10 percent of the total problem-solving time is spent at the system 
interface. Users come to the system armed with preconceived assumptions 
about the disposition of the system’s data, and spend just enough time with 
the system to test their assumptions. Nevr hypotheses are not formulated 
on-line, but are developed during subsequent periods of isolation from the 
system. From Sackman’s evidence and that o .stained in the extended Bruner 
experiment, it appears that the time spent by a user in direct on-line 
interaction with a system is best represented by the testing and feedback 
operations of the problem-solving process discussed earlier. 



203 



A considerable portion of this chapter has been devoted to a discussion 
of the problem-solving process, and I have assumed that this process arises 
as a consequence of a user^s %nf ovmat%on need. Infoxfnation need was defined, 
in Section 7, as the result of either hypothesis disequilibrium or active 
hypothesis testing. The consideration of problem-solving behavior, present^ 
ed in Section 8, has emphasized that, while frequently adopting a strategy, 
a problem solver is apt to derive considerably less than maximal benefit 
from the information available to him, even when he adopts an identifiable 
strategy. This failure may be attributed largely to the subject's lack of 
perception of the full range of alternatives involved in the problem. 
Consequently, the design of effective systems must take into account these 
human behavior patterns. Similarly, the extent to which such behavorial 
considerations are accounted for in IS&R systems design provides a basis for 
evaluation of such systems. As will be shown, the necessar re value 

is defined through a f oj.ixiallzat±ou ux the user's hy’pothes'is stTUciuve. This 
formalization is found in a 7i:econsideration of the hypothesis-testing madel 
briefly presented in Section 8 of this chapter. 

9,2 Hypothesii; Structure Model 

One of the anajor tenets of this chapter is that, despite the apparent 
complexity cf most tS&R systems, the problem of system performance evali-ation 
reduces to the problem of formulating an adequate description of a singJ.e 
interface — a laan/machine decision-making interface. We shall character- 
ize this user/ IS£tH-sys tern interface as the communication between two 
cognitive structures. The reader is referred to Figure 9.2.1 for a 
conceptualization of this interface Carter Carbonell [64]). In this rigure. 
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Figure 9, 2,1; The Man/Machine Interface. 
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the user's data structure models the goal Cthe satisfaction of his 
information need) that he is attempting to achieve, while the IS&R data 
structure embodies all of the data in the system that may be of value in 
goal achievement. Figure 9.2.1 also suggests that the user must successive- 
ly evaluate the system and its outputs, exercise control over the system 
and, finally, act as a decision-maker concerning the relevance of the data 
to his goal. 

Following the discussion in Section 7 of this Chapter, I assume that 
hypothesis testing, data-element acquisition and data-element ordering are 
the essential aspects of cognition in respect to use of an IS&R system. In 
this case, a cognitive data structure may be updated through a process of 
active hypothesis testing. Two distinct forms of hypothesis testing have 
been identified: 1) that concerning the existence of a specific data- 

element value of a node in the cognitive structure, and 2) that concerning 
the occurrence of data elements conforming to a particular relational pattern. 
The need to acquire data for the testing of such hypotheses creates what I 
have chosen to call information need* 

Clearly defined use of an IS&R data base demands a prior and well-defined 
information need. Examples of an information need are; ’’What is the boiling 
point of water?” or ’’What is the melting point of Titanium Oxide at various 
pressures?” Presumably, a user comes to an IS&R system with the intent of 
discovering data that will satisify some such information need. It is 
postulated that the user's interaction with the IS&R system will be guided 
by a hypothesis he has formulated about how the data are to be retrieved. 

A user's hypothesis about a system's data is presumed to be dependent upon 
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his information need, which in turn is assumed to be dependent on the user s 
cognitive structure. 

An initial hypothesis about data to be retrieved from an IS&R system 
data base could be phrased as: "There are documents in the system dealing 

the melt'Lng pvope:pt-ies of Tit-nium Oxide."* Data elements retrieved 
by the searcher will serve either to support or refute the hypothesis under 
consideration. It is possible, however, that the retrieved data are 
insufficient to test the initial hypothesis, and that a new, more appropriate 
(i.e.j likely to be supported) hypothesis will need to be formulated. More 
appropriately restrictive hypotheses might be; "There are documents deal- 
ing with the properties of Titanium Oxide", "There are documents dealing 
with Titanium Oxide." If the retrieved data do serve to test the hypothesis, 
and if the outcome is positive, then a decision will have to be made as to 
whether the hypothesis satisfies the information need. If the information 
need has not yet been satisfied, then a new hypothesis must be formulated 
and the data acquisition process resumed. 

The cyclic nature of this process is further formalized in Figure 9.2.2. 
The information need, as determined by the user’s own cognitive structure, 
is assumed to define a hypothesis structure which may be interpreted as a 
representation of all data which have been or are expected to be retrieved. 
Data elements from this hypothesis structure (H.S.), in conjunction with the 
language parameter of the system’s index space, define the formulation of 



^ This example illus trates the use of simultaneous scanning and , as the 
discussion subsequently shows, such a strategy often leads to a more 
extended interaction with the system than would a conservative focus-^ 
ing strategy. Strategy is further considered in Section 9.3« 



( 



207 



O 

ERIC 




208 




of the query. The retrieved data are sampled until a sequence of data 
elements acceptable in terms of the H.S. is obtained. If the data elements 
of this sequence are identical with the data elements demanded by the 
information need, then the sampling process is temporarily suspended until 
a new information need arises. However, if the retrieved data elements are 
insufficient to satisfy the H.S., then it is assumed that the H.S. and the 
attendant query itself must be modified (e.g'.j through the incorporation of 
new data elements into the set of acceptable strings defined by the H.S.) . 

The reader, at this point, should realize that the steps enclosed within the 
dotted line of Figure 9.2.2 are a re-statement of the process-of-inquiry 
model previously depicted in Figure 8.3.2. I shall now represent this 
process in terms of a generalized machine that processes data elements as 

inputs . 

The hypothesis formulation model is viewed as a sequence of two finite 
deterministic Rabin-Scott automata— see Figure 9.2.3. I assume the existence 
of a finite input alphabet E of data elements which correspond to the 
’bi’OLTisfn'tssion d&codvyiQ elements cf a system’s index space. A string input 
to the user's hypothesis-structure automaton is, then, a finite sequence of 
data elements from Z: d d,d„...d . The hypothesis structure may therefore 
be described as a finite automaton over Z, 

j H.S. = CS, TRANS, s^,F) 

where S is a finite set of states of H.S., TRANS Ca binary transition matrix) 
is a mapping of S x Z into S and s^ and F are the initial and final states 

respectively . 

In this model the states represent the data elements that the user believes 
to exist in the system. The binary transition matrix , TRAi^S > 
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Figure 9.2*3: The Hypothesis Structure Automaton. 

(d^ = i th data element; H.S, = hypothesis 

structure; D == decision automaton; 1 == 
stop state) 
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cx £ TRANS ot . = 0,1, indicates by means of a 1 that data element d. can 
ij ’ iJ 

be followed by data element . This serves to define the set of possible 
strings of data elements d. ,d. »* • that can potentially be 

associated with a given hypothesis. A cut-off value, is applied to the 

sampling, so that if the sample size (.the number of data elements observed 

from the retrieval operation) exceeds the cut-off X, a new hypothesis 

structure is selected involving a new transition matrix TRANS and a new 

s and a new set F. This process is continued until a string 
0 

d d d . . . d^ 

12 3 f 

feF, f s X is obtained. A string accepted by the hypothesis structure is 

then input to the decision automaton, D. 

The string input to the decision automaton is one of the set of possible 

strings d. ,d. , . . . ,d . defined by the hypothesis-structure automaton. 

^1 ^2 r 

Thus, the decision automaton is a finite automaton defined over TRANS (or 
TRANS ' ) , 

D = CS*, TRANS*, 



where S* is a set of states representing the data elements of the information 

need, TRANS* defines the exact information need/data element sequence, and 

s * and F* (a singleton set) represent the initial and f.inal data elements 
0 

of the information need. 

If the information need involves just a single datum then {S } = 

s ^ « F* • However, if the information need is a pattern of data elements 
0 

Cexhihiting an ordering relationship) then ^* TRANS* define a specific 

string of data elements. If, upon completion of the scan of the input tape, 
the decision automaton is not in its final state, then the operation of the 
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H.S. automaton is re- initiated; if the final state is reached^ then the 
decision automaton outputs an ’*1’* to terminate the data gathering process. 

This IS&R interaction and decision-making model assumes that the data 
inputs are stochastically invariant. Thus, the user^s information about the 
IS&R data structure changes with each cycle in the model — including the two 
feedback loops Ch.S. into itself and D to H.S.). In this model the obvious 
user-controlled variables are X, TRANS, s and ¥ . assume that the 

definition of D is controlled by the specification of the information need. 

It is interesting to note that the subject’s probability estimations of the 
transitions in the TRANS matrix are conveniently modeled by the type of 
Bayesian estimator employed in the information acquisition discussion of 
Section 7.5 of this Chapter. while such considerations tend to idealize 
human behavior, Peterson and Beech I65j and Schum [66] have shown that persons 
jtend to adopt the conservative strategy (optimal) in their interential 

■k 

probability estimates — especially when faced with an increase in the amount 
of data to be evaluated. This is consistent with the observations, made in 
Section 8, that subjects may not readily accept the value of negative 
evidence. The null hypothesis is always present — indeed, it is reflected in 
the choice of A. Generally, lack of user confidence in the current hypothesis 
structure is reflected by the assignment of a small value to the cut-off A. 

This serves to place additional weight on the probability of a different 
hypothesis structure from the set of alternative hypothesis structures prior 
to subsequent interaction with the system. 

This hypothesis-testing, decision-making model appears to provide the 
necessary quantification for the testing and evaluation of systems performance. 
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However, before a final consideration is made of the concept of relevance, 
let us examine an example of the hypothesis^testing cycle that has been 
developed. 

9 . 3 An Example of the Hypothesis Structure 

The theoretical notion of a hypothesis structure, which I have develop- 
ed in the previous section, is susceptible of exemplification as I shall 
show in this section. The illustration which is offered is derived from a 
sample query given in the CAS Preparat-ion of Searoh Profiles manual [67]. 

The query, as originally formulated, is shown in Figure 9.3.1, using set 
notation rather than the form employed by CAS in constructing profiles 
(queries). Several inferences regarding the user's knowledge of the retrieval 
system and of its data base can be drawn by inspection of the query in the 
form shown in Figure 9.3.1. For instance, the use of truncation, as for 
TOXIC* (a search term which would ’’match” TOXIC, TOXICITY, TOXICOLOGICAL, 
etc.), to produce generalized search terms* indicates that the user 
hypothesizes that the CAS search system is capable of handling such specifica- 
tions . 

One may argue similarly that the use of logic (AND and OR at least) as 
well as the use of alphamameric** characters to describe the search terms 
indicates a good understanding of the attributes of the system on the user’s 
part, or else high expectations on the part of the user as to the capabilities 
of the CAS search system. Although one is tempted to explore further these 



* For details see, for example, Colombo and Rush 168]. 




** A term signifying alphabetic (Roman) characters, punctuation and the 
Arabic numerals. 
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Pharmacol* . 


/ \ 


or 


1 


artificial 


toxic 1 

s 


1 


or 


or 


^ and ( 


sweeten* \ 


poison 




or 


or 




saccharin* 


analy* ' 
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Figure 9.3.1: 



The Original Query , as taken from the 
CAS Fvepar^atton of Seccpoh Profiles. 
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considerations, my immediate purpose will be served if our attention is 

confined to inferences about the user^s hypotheses concerning the contents 

of the data base upon which the search system operates. 

The user is assumed to have come to the system with an information need 

expressed as a composite of three hypotheses, 

The system contains documents dealing with the pharmacology, 
toxicology and/or analysis of artificial sweeteners. 

Taking this expression, augmented by the terms SACCHARIN and POISON, and 

employing truncation, we obtain seven data elements which constitute the 

initial query. These seven data elements fora the names of the rows and 

I columns of the user’s TRANS matrix, as illustrated in Figure 9.3.2. For 

convenience of display, the TRANS matrices used throughout this example 

have been limited to two dimensions. In the generalized case, however, 

multidimensional arrays would have to be employed to account for the many 

possible permutations of strings of data elements which would potentially 

satisfy the conditions specified in the query. The TRANS matrix is entered 

by means of the first recognized data element, s^, in the search output. 

A 1 in a cell of TRANS indicates a term in the ±th row of TRANS may be 

followed by a term in the column of TRANS. The symbol 1^ signifies that 

the transition (from i to j) results in the attainment of a final state 

(•£.©., the attainment of an feF). This means that the document containing 

the sequence of data elements leading to a final state is accepted by the 

hypothesis structure CH.S.) automaton. Examples of strings of data elements 

which would be accepted by the H.S. automaton CTRANS matrix) of Figure 9.3.2 
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Figure 9.3.2: The Initial Hypothesis Structure (TRANS matrix). 

s = initial term 
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f = final state 

* denotes truncation of suffix or prefix 
X = 5 
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• PHAEMACOL* ART IFIC lAL SWEETEN* 

• ANALY* SACCHARIN* 

• SWEETEN* 

• SACCHARIN* 

Such sequences of data elements are not intended to carry the Implication 
that any one specific logical or functional relationship exists between these 
data elements as they occur in a source document. Rather, a particular 
sequence merely represents the user's expected order of occurrence of the 
data elements in the document. 

With a sampling cut-off value of five \ = 5) for the initial 

query, as represented in Figure 9.3.2, the document whose title is given 

below was retrieved (and was accepted by the H.S. automaton). 

// 200 Analytical methods of artificial sweeteners. 

Determination of sodium cyclamate. 

The string of data elements 

ANALYTICAL - ARTIFICIAL - SWEETENERS 

supports the hypothesis that there are documents in the data base which 

contain data on the analysis of artificial sweeteners. However, the other 

two hypotheses of the original composite hypothesis remain unsupported. It 

should be noted that the sequence of data elements presented to the H.S. 

automaton (with X = 5) namely 

ANALYTICAL METHODS OF ARTIFICIAL SWEETENERS 

Contains two non-query data elements, 

• METHODS 

• OF 

These data elements correspond to null states in the TRANS array. The remain- 
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ing three non-query data elements in the retrieved dociament — 

^DETERMINATION ^ SODIUM- CYCLAMATE-'-are not considered by the R.S. automaton 
because A = 5, however they may serve as meta-inf ormation in the formulation 
of a subsequent query. 

The second query (or search iteration) represents an attempt, on the 
part of the user, to obtain clear support or refutation of the remaining 
two hypotheses. However the user hat" already obtained sowe information 
about these two hypotheses. The reader should recall that the absence of 
data provides both meta-information with respect to the information need and 
information with respect to the process of inquiry. Thus, the new query is 
modeled by an updated matrix, TRANS^ (see Figure 9-3.3), in which two new 
terms, taken from the data elements associated with docximent number 200, 
have been added to those of TRANS; 



With X increased to 20 (document # 200 required a sampling of 5 data elements 
to be retrieved) the following documents are retrieved and accepted by the 
H.S. Automaton: 



• DETERMINE 



• CYCLAMATE* 



# 200 



Analytical methods of artificial sweeteners. 
Determination of sodium cyclamate. 



# 100 



Mechanism of the laxative effect of sodium sulfate, 
sodium cyclamate and calcium cyclamate. 



# 350 



Rapid method for the estimation of impurities in 
saccharin and sodium saccharin. 



y/ 50 



Peptide synthesis with mixed anhydrides from N-acyl 
amino acids and saccharin. 



Distribution and excretion of carbon-l4-cyclamate 
sodium in animals. 




# 39 
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Documents 100, 350 and 39 support tlie remaining two hypotheses. These 
documents, furthermore, would satisfy the information need if the user's 
D automaton* provided an explicit association relationship between the 
retrieved data elements; 

MECHANISM, LAXATIVE, rMPURITIES, EXCRETION 
and the search terms phcomacotogy and tox-icotogy . If the D automaton fail- 
ed to accept any or all of documents 100, 350 and 39, this would become 
clear (to an observer) by the observation of the user effecting a third 
search iteration with the system. Although we shall not pursue the example 
further, a possible subsequent H,S. modification would include the assignment 
of a value of A = 10 (to eliminate document # 50) and the addition of the new 
data elements 

• PHYSIOLOG* 

• IMPURIT* 

• EXCRETION* 
to the TRANS’ matrix. 

10. A Reconsideration of the Concept of Relevance 

”How is bread made?” 

”I know that” Alice answered eagerly. ”You take some flour — ” 

’’Where do you pick the flower?” The \^ite Queen asked. ”In 
a garden, or in the hedges?” 

’’Well, it isn’t picked at all,” Alice explained. ”It’s ground — ” 

”How many acres of ground?” said the White Queen, ’’You mustn’t 
leave out so many things.” 

Lewis Caroll 

Although I have, in this chapter, neither specified not resolved all 
possible sources of doubt and have no doubt overlooked many important topics, 




* The structure of this automaton is analogous to the association table 
of related terms mentioned by Kochen ai.aJ. 169]. 




220 



I hope the material presented has contributed to a clearer understanding 
of the concepts of evaluation and relevance. Relevance assessment has been 
described as an integral part of an algorithm that embodies the process of 
inquiry which is characteristic of an interactive retrieval interface. 
Discussion has so far relied heavily on the acceptance of the concepts of 
information needy inquiry, problem solving, hypothesis testing, attribute 
identification and, implicitly, data-element relevance. Let me summarize 
briefly what has been said about relevance and evaluation in this context. 

I have accepted as a premise that the problem of IS&R systems evaluation 
is both paramount to effective systems design and a corollary to the 
theoretical considerations that have been developed in the previous chapter. 
The arguments that have been presented have assumed, further, that 
evaluatioii is best described as a value judgment or formal correlation 
between data retrieved by a user and his information need. There remains 
the task of quantifying this correlation. All correlational measures, in 
IS&R applications, presumably reduce, ultimately, to variations of the 
measures of recall and preoisiony which in turn are based on a user^s precise 
determination of document relevance. Unfortunately, relevance remains a 
subjective, fuzzy concept. I have chosen to attack the problem through a 
detailed analysis of the concept of information need. It is believed that 
the testing of hypotheses concerning the user’s environment (analogous to 
the Scientific Method) is consequent upon his information need. It has been 
useful to describe the hypothesis information-need hypothesis cycle in 
terms of problem solving behavior that focuses on the user’s identification 
of attributes and the emplo37ment of optimizing strategies. This conceptualiz— 

23'i ^ 
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ing on my part about a user^s thinking and approach has led to the develop- 
ment of a hypothesis structure model which I believe to be descriptive of an 
IS&R“system user^s data-element acceptance abilities. This model, as we 
shall see directly, provides for the necessary quantification of the user^s 
relevance decision. 

10.1 Relevance 

Cooper 170] describes the action of an IS&R system in response to a 
query as the establishment of a **ranking among the documents in the 
collection.** The rank-ordered documents are then examined, one-by-one, 
and a decision is made concerning their utility in the satisfaction of the 
information need. However, the previous discussion implies that utility 
decisions are made upon data elements — not documents. The reader will recall 
that the inputs to a hypothesis-structure automaton are strings of data 
elements, and the concept, or framework, of a document was treated as purely 
coincidental to these inputs. We assume that strings of data elements are 
the essential components of the concept of relevance. Two distinct, vetevant 
strings of data elements are identified; 

• Data elements accepted by the H.S. automaton. 

• Data elements accepted by the H.S. automaton and 
matching the user's information need. 

The complexity of the strings of data elements are postulated to range 
from a specific datxmn to complex, prescribed patterns of data elements. To 
me, the important observation is the recognition that relevance refers 
either to the acceptance or to the matching of data elements. Thus, three 
forms of data-element match are identified: 
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• specific data element value (datum value) e.g.^ 
27+0.1 feet high 

• total data element pattern match Patent 

Office novelty and anticipation searches) 

• general pattern or equivalence class of data 

elements a genus of organism) 



Regardless of the type of match that is required in a search, the probability 
that rank^ordered data elements will be relevant accepted) is assumed 

to depend on the degree to which the user^s query (hypothesis structure) can 
be embedded in the IS&R data structure. Embedding is interpreted as 
structural similarity. The extent of the similarity between a query and the 
system*s data base is measured in terms of the degree , of overlap of the 
syntax, data elements and relationships present in both the user’s H.S. and 
the IS&R system. 

The two forms of relevance are shown in Figure 10.1.1, which itself is 
a depiction of the various processes that, potentially, are utilized in an 
interactive retrieval environment. The correlation between the query and the 
retrieved data elements is not shown because it is assumed always to be 
perfect. In any event, the two forms of relevance are believed to be an 
integral part of a generalized inquiry algorithm. In such a context, relevance 
is a measure of the precision of the measurement and ^perception employed by 
the algorithm. Various forms of such an algorithm can he listed by increas- 
ing precision: 



• Scientific Method 



Process of Inquiry 



• Problem Solving 



• Hypothesis Testing 
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• Attribute Identification 

Even though the algorithms associated with the first two levels of precision 
may be esoteric, they are still accountable in terms of the aforementioned 
hypothesis structure model. 

Interesting support for the particular view of relevance which is 
presented in this Chapter can be found in the results of experiments 8 and 
12 (analysis of precision and recall failures) of the recent United Kingdom 
Chemiaat Infoimation Science (UKCIS) report I71]. Briefly, the most signi- 
ficant factors influencing "precision" failures were found to be 172]; 

• items retrieved by terms in the wrong context 

• wrong correlation of terms 

These conclusions indicate an observer-deduced failure in the s 



hypothesis structure conf iguration— i ^ the user’s initial hypothesis about 
the form of the data base is interpreted by an outside observer as incorrect. 



Corrective search iterations are rec 
hypotheses (updating of the transition values in the hypothesis structure 
matrix - TRANS). The UKCIS study also identified the following factors, 
which were inferred to influence the user failures I 

• profile narrower than interest (concepts 

• inadequate concept expansion 

• tc>\> or 

• input error 

These results support the asBarz^±z?z^ JDBbB in Section 8 of this chapter: as 

viewed by an outside observer^ users generally fail to identify and use 
pertineiut system attributes. Part of this problem is believed to stem from 
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the user^s inability to perceive the system’s data structure and is reflect- 
ed in his incorrect hypothesis structure. Attribute identification is no 
small problem, since most systems emphasize the importance of a "middle man" 
who identifies the "relevant" system attributes for the user. It becomes 
apparent that either user problem-solving abilities need to be improved, or 
IS&R systems must match their operation to classes of problem-solving 
skills manifested by their users. 

10.2 Evaluation 

Since the quantification of relevance reduces to the identification and 
<j?<e6^:rription of strings of data elements accepted by the H.S^ automaton, the 
(Tixrcfj^em of' IS^R systems evaluation is resolved through an analysis of the 
variables associated with the hypothesis structure model. Six major variables 



are identified. 



• User’s system confidence value as indicated by his choice 

value, X. as^ar's view of the reliability 

of the system is seen in the inverse relationship that 
exists between X and confidence. A high X-value implies 
a potentially lengthy sampling and a low confidence in 
the system’s rank ordering. 

• The number of hypotheses tested. 

» The time between iterations. 

• The time devoted to problem solving . 

• The number of decisions made. 

• The number of redundant data elements that have to be 
examined before a pattern can be perceived. This amounts 
to a paraphrasing of Cooper’s AVevage Seax*ch Length 1731; 

The number of irrelevant data elements one would expect to 
search through before finding as many relevant data elements 
as needed. 

Measures of IS&R system evaluation, by implication, should not be averaged 
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above) may be employed by an outside observer to describe system performance 
with respect to well-defined, classes of users Cthis imj>lies, also, the 
possibility of inducing well-defined classes of queries) , Sackman touches 
on this problem [74J : 

An empirically derived taxonomy of man-computer tasks should 
be developed based on demonstrated differences in human-problem 
solving style rather than on a confusing welter of strictly 
descriptive characteristics. 

The notion "classes of users" is posited to be identically similar to the 
notion of the Bruner, strategies of problem solving discussed in 

Section 8. Thus, a system ^s performance may be evaluated on its ability to 
satisfy the information needs of well-defined classes of problem— solving 
users. The establishment of such "well-defined classes" must be a prerequisite 

evaluation. 

In summary, I have attempted to provide in this chapter a workable 
definition of relevance, which is consistent with theoretical indexing and 
human behavioral considerations. Possibly the most significant feature of 
the previous discussion about relevance is its attempt to provide an integrat- 
ed , cross-disciplinary approach to relevt^nce assessment . The retrieval 
interface is not studied i/n vacuo ^ but as a complex symbiotic process. A 
model of retrieval problem solving has been developed that is assumed to be 
amenable to quantification. This quantification is achieved through a 
definition of the H.S. and D automata and, of course, the concept of the data 
element. From this point of view, evaluation has been presented as a measure 
defined over classes of users and tlieir queries. Finally, it should be 
emphasized that prior imprecision in reference to the term "relevance" has been 
reduced to the subjective selection by the outside observer of transition 
values in the H.S. automaton. 
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VI. SUMMARY AND RESEARCH DIRECTIONS 

The reason why there is no Table or Index added hereunto is^ that 
every page is so full of signal remarks that were they couched in 
an Index it would make a volume as big as the book, and so make 
the Postern Gate to bear no proportion to the building. 

Howell 

The material contained in this brief concluding chapter is presented in 
three sections. First, a summary of the previous five chapters is presented. 
This is followed, secondly, by a discussion of the several ways in which index- 
ing theory models information storage and retrieval. Finally, directions for 
future research in indexing theory are outlined. 

1 . Summary 

We began this inquiry into indexing theory with a brief consideration of 
some of the causes of the phenomenon commonly referred to as the ’’information 
explosion.” It was concluded that present day shortcomings in information 
retrieval are the result of a failure to properly contend with the problem of 
data representation. Science does not suffer from a lack of accumulated 
knowledge but, rather, it suffers from the inability to efficiently conmmnicate 
what has been previously discovered. As a consequence of the difficulty of 
data communication within and between the Sciences, there has been a growth 
in the number of specialized areas of investigation. In many respects this 
proliferation of specialties has only served to further hamper effective 
data and information retrieval. 

Information storage and retrieval was initially characterized as a 
communication interface between the data of the Sciences and a diverse 
population of users. The object of any interaction with IS&R systems is to 



232 









233 



develop a high level of shared agreement or coimon understanding between the 
storage scheme of the system and both the information need and the resulting 
search techniques employed by the user. It was postulated that the effective- 
ness of this interaction was, primarily, dependent on the fidelity of 
document representation. Furthermore, it was assumed that the indexing 
operation was a prime exemplar of the process of document representation. 

The field of IS&R suffers from the absence of a unifying theory of 
document (data) transfer. Consequently, one is embarrassed by the difficulty 
of effectively evaluating the many IS&R systems that are currently in 
operation (and the many more that are still in the planning stages). It was 
concluded that the primary goal of theory deve3„opment is the creation of 
sound evaluation measures. Furthermore, it was argued that a theory of 
indexing could serve as an adequate model for many of the processes of 
information storage and retrieval. If properly developed, this theory would 
provide the basis for the systematic analysis of both indexing procedures 
and of resultant indexes, and it would provide the conceptual basis for the 
development of evaluation techniques. Unfortunately, initial investigation 
is hampered by a long history of considering indexing as an ’’artful” practice. 
Thus, a theory of indexing must first turn to a consideration of the follow- 
ing fundamental questions: why index at all; what should be indexed; what 

is the role of indexing ”aids” in the process of indexing and; how are indexes 
to be evaluated? 

Previous indexing theories can be faulted for not providing answers to 
these fundamental questions. Two related theories were reviewed, mainly for 
the purpose of building in the reader an appreciation of their general tone 
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and direction of approach, Heilprin’s theory raised several interesting 
points which have been puiaued in the current work toward a theory of index- 
ing. These include the conceptualization of indexing systems as closed 
systems; the importance of the effect of noise iri the indexing operation; 
the concept of a search path and^ finally, the use of an indexing region as 
a means of system characterization. 

Chapter four presented the basis for a comprehensive theory of information 
storage and retrieval. The fundamental thesis was that this theory had its 
genesis in a theory of the indexing process. In other words, as has been 
previously emphasized, it is believed that the success of an IS&R system 
depends primarily on accurate and complete document representation, and that 
such document representation is the goal of any indexing process. It was 
contended that the index provides the necessary linkage between a multiplicity 
of sources and a single receiver. Conceptually, the indexing system is 
initially viewed as a black box that accepts documents as its inputs and 
produces the index as its only product Coutput) . Various sources produce the 
documents which become the elements of the document space and receivers 
produce queries which are matched against the index and, eventually, against 
the document store. Whether considering the source/document-space interface 
or the query/ i’idex interface, the elements of the underlying communication 
phenomena are the sames sets of documents, sets of attributes and sets of 
relations expressing a connection between documents and attributes. I have 
chosen to represent these attributes by the concept of the data element. 
Following the progression of schema presented in Figure 10.1 of Chapter IV, 
we first considered the necessary criteria for effective communication and 
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concluded that the index provided the requisite common experience set between 

the source and the receiver. We then more precisely positioned the indexing 

\ 

system intermediary between the communication channel and the receiver 
(searcher) and emphasized the role of *'noise*' and feedback. Following a 
specification of the ''position” of the black box or indexing system in 
communication, we considered a theory of its operation. This theory, called 
the indexing process, defined the essential operation of the indexing system 
to be the creation of a representation of the document space. The analysis- 
document transformations and the final index-query transformations were 
shown to be, respectively, a prerequisite to, and a function of, the document 
space representation. Examples of these transformations were provided through 
the analysis of a sample document. Finally, the operating characteristics 
of the indexing system ware modeled by means of the index space. From a 
different point of view, the concepts of error, organization, information 
and search were introduced through a consideration of the indexing process as 
a thermodynamic system. Thus, indexing was viewed as an order-increasing 
operation that identifies common data elements and relations between data 
elements present in the input document stream. The existence of both the 
"perfect” indexing system and the theoretiial index were then postulated and 
compared with their ra:al-world counterparts. Several suggestions for real- 
world indexing improvements (with the idea of emulating the theoretical index) 
were presented and, finally, it was argued tht the value of each newly 
re,trieved data element was a function of the order of retrieval. 

In Chapter V attention was directed toward applying the current indexing 
theory to the problem of IS&R systems evaluation. It was postulated that 
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successful systems evaluation is based on an understanding of why certain 
retrieved documents are judged to be relevant to the searcher. Specifically^ 

I wanted to know more about data element relevance. Unfortunately, the 
concept of relevance (and evaluation, for that matter) has had a rather 
confused history in the field of IS&R. Thus, I have chosen to investigate 
the nature of data element relevance by means of a reconsideration of the 
concept of information need. It was argued that an information need resulted 
from two alternative forms of hypothesis testing. Consequently, the process 
of satisfying an information need involves the utilization of problem solving 
strategies and selective decision-making criteria. Following a discussion 
concerning the processing of attributes, the results of a brief experimental 
investigation were presented that indicated that the success of problem 
solving strategies, and of hypothesis testing, was directly related to the 
level of structure associated with the problem solving setting. Finally, a 
hypothesis structure automaton was presented as a model of how a searcher 
evaluates the relevance (acceptability) of retrieved strings of data elements 

2. Indexing Theory as a Model of IS&R 

In Chapter II we broadly characterized information storage and retrieval 
as serving to provide the basis for document data— element representation 
and searching. More specifically, the diverse operations of document 
acquisition, data-element representation, document storage, query preparation 
data-element searching, document retrieval and retrieval evaluation were 
singled out, in Figure 2.1 of Chapter II, as prerequisites for successful 
information retrieval. That figure also showed the document representation 
and storage chain merging with the query/ informat ion-need processing chain 
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at the operation of the data*-base search. One of the major conclusions to 
be drawn from the presentation in Chapter IV is that the indexing operation 
Cdata-^element/document representation) is the controlling factor in the 
success of the search operation. That is to say, efficient storage and 
search algorithms are meaningless, with respect to satisfying the information 
need, if an accurate and complete document representation, indexing, 

is not provided initially. 

ThuSj we have modeled IS&R with a theory of indexing which amounts to a 
theory of IS&R^s most crucial operation. Like the characterization of the 
index, the theory is itself bi-'directional in nature. The first part of the 
theory has dealt with the problem of the representation of data elements in 
a multiplicity of documents; the second part of the theory has been concerned 
with the user viewing the index as a tool for the resolution of an information 
need. It was concluded that one must speak about the use of the index when 
discussing the theory of its construction since an index is surely created 
to be used. Furthermore, it was concluded that the manner of index construct- 
ion (and the form of the resultant index) specifies the class of queries that 
are acceptable to the retrieval system. Thus, a theory of index construction 
is, implicitly, a theory about index search and, consequently, a theory of 
information retrieval. 

Both portions of the theory are amenable to evaluation. First, indexing 
viewed as a process for the representation of data elements and relations 
between data elements was modeled by a set of transformations which are 
applied by the Indexing system to the input docioments to create the index 
as an end piroduct. The effectiveness of the transformations, and the 
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specificity of the index Cas a representation of the input documents) are 
evaluated by comparing the resultant index with the theoretical index. 

Second, indexing viewed as the creation of a search interface places emphasis 
on the association structure created by the indexing process . We have 
postulated that the utility of the index can be measured following an under- 
standing of how people go about using the index. Consequently, the evaluation 
of whether a system is able to provide an acceptable string of data elements 
is obtained through the observation of the rate of convergence of user 
decisions and hypotheses toward the satisfaction of the information need. 

3, Directions for Future Research 

’But I should like to know...’ Pippin began. The information presented 
in this dissertation has only begun to satisfy this author’s inquisitiveness 
about the processes of indexing and information storage and retrieval. As a 
consequence of this investigation it is possible to identify three separate, 
but conceptually related, directions for future research. 

• Further studies in the theoretical representation of indexing. 

It has been beneficial, from a theoretical viewpoint, to characterize the 
operations of the indexing system both by means of generalized transformations 
and by means of the index space. The next step is to develop these trans- 
formations and representations for specific operational indexing systems. It 
is hypothesized that such a detailed analysis will indicate the degree of 
ordering effected by alternative Indexing systems. Also, it is believed that 
such a detailed analysis will show what types of data elements are preserved 
or discarded in the specific indexing process. 
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• The index as the search interface. 

Further studies, following the lines of the extended Bruner experiment, 
should be undertaken to determine how users go about the identification of 
attributes in a retrieval setting. It is hypothesized that an understanding 
of how users differ with respect to the identification and the utilization 
of attributes (and structure) will be helpful in the design of improved 
indexes* From the point of view of evaluation, the hypothesis structure 
atuomaton suggests the development of a simulation model for predicting the 
retrieval behavior (and information-^need satisfaction) of classes of users 
under varying retrieval requirements. 

• The Case Grammar Index. 

It is believed that this approach to the analysis of document content 
will yield an index that accurately represents data elements and relations 
between data elements. Such an assex^tion will have to be tested by the 
indexing of a sample document space. 

This author hopes to have the opportunity of continuing work along the 
lines indicated above, and it is expected that various aspects of this 
research will be continued in these Laboratories. 
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